Llm Training Masterclass Build Sota Models

Show description 1,216 chars · AI
LLM Training Masterclass | Build SOTA Models

LLM Training Masterclass | Build SOTA Models

[ LLM TRAINING MASTERCLASS ]

Build State-of-the-Art Language Models | From Theory to Production

INTRO
ARCHITECTURE
DATASETS
TRAINING
OPTIMIZATION
EVALUATION
PAPERS
IMPLEMENTATION
DEPLOYMENT

[ SYSTEM OVERVIEW ]

this is your complete guide to building state-of-the-art large language models. from foundational transformer architecture to deployment at scale. everything you need to go from research papers to production models.

what you'll master

transformer architecture internals (attention, ffn, normalization)

dataset curation and preprocessing at scale

pre-training, instruction tuning, RLHF, and DPO

scaling laws and compute-optimal training (chinchilla)

distributed training (FSDP, DeepSpeed ZeRO)

mixed precision, quantization, LoRA

evaluation frameworks and benchmarks

production deployment strategies

key resources

HuggingFace Transformers Docs

ArXiv cs.CL (Computation & Language)

Awesome-LLM GitHub

DeepSpeed Documentation

RECOMMENDED PATH: follow modules in order. each builds on previous concepts. start small (1-7B params), reproduce papers, then scale up.…
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>LLM Training Masterclass | Build SOTA Models</title>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;600;700&family=Inconsolata:wght@400;600;700&family=JetBrains+Mono:wght@400;600;700&family=Source+Code+Pro:wght@400;600;700&display=swap" rel="stylesheet">
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        :root {
            --bg-dark: #0d1117;
            --bg-darker: #010409;
            --bg-card: #161b22;
            --border: #30363d;
            --text-primary: #c9d1d9;
            --text-secondary: #8b949e;
            --accent: #58a6ff;
            --accent-2: #1f6feb;
            --green: #3fb950;
            --yellow: #f0883e;
            --red: #f85149;
            --purple: #bc8cff;
        }

        body {
            font-family: 'JetBrains Mono', monospace;
            background: var(--bg-darker);
            color: var(--text-primary);
            line-height: 1.7;
            overflow-x: hidden;
        }

        .container {
            max-width: 1400px;
            margin: 0 auto;
            padding: 20px;
        }

        header {
            background: var(--bg-dark);
            border-bottom: 2px solid var(--accent);
            padding: 30px 0;
            position: sticky;
            top: 0;
            z-index: 100;
            box-shadow: 0 4px 20px rgba(0, 0, 0, 0.5);
        }

        h1 {
            font-size: 2.5rem;
            color: var(--accent);
            font-weight: 700;
            text-transform: uppercase;
            letter-spacing: 2px;
        }

        .tabs {
            display: flex;
            gap: 10px;
            margin: 30px 0;
            overflow-x: auto;
            padding-bottom: 10px;
        }

        .tab {
            padding: 12px 24px;
            background: var(--bg-card);
            border: 2px solid var(--border);
            color: var(--text-secondary);
            cursor: pointer;
            font-size: 0.9rem;
            font-weight: 600;
            text-transform: uppercase;
            letter-spacing: 1px;
            transition: all 0.3s;
        }

        .tab:hover {
            border-color: var(--accent);
            color: var(--accent);
        }

        .tab.active {
            background: var(--accent-2);
            border-color: var(--accent);
            color: white;
        }

        .content-section {
            display: none;
        }

        .content-section.active {
            display: block;
            animation: fadeIn 0.4s;
        }

        @keyframes fadeIn {
            from { opacity: 0; transform: translateY(10px); }
            to { opacity: 1; transform: translateY(0); }
        }

        .module {
            background: var(--bg-card);
            border: 2px solid var(--border);
            border-left: 4px solid var(--accent);
            padding: 30px;
            margin: 30px 0;
            transition: all 0.3s;
        }

        .module:hover {
            border-left-color: var(--green);
            box-shadow: 0 8px 30px rgba(88, 166, 255, 0.1);
        }

        .module h2 {
            color: var(--accent);
            font-size: 1.8rem;
            margin-bottom: 20px;
            font-weight: 700;
            text-transform: uppercase;
        }

        .module h3 {
            color: var(--purple);
            font-size: 1.3rem;
            margin: 25px 0 15px 0;
            font-weight: 600;
        }

        .module ul {
            list-style: none;
            padding-left: 0;
        }

        .module li {
            padding: 10px 0;
            padding-left: 30px;
            position: relative;
            border-bottom: 1px solid var(--border);
        }

        .module li:before {
            content: ">";
            position: absolute;
            left: 0;
            color: var(--green);
            font-weight: 700;
        }

        .code-block {
            background: var(--bg-darker);
            border: 2px solid var(--border);
            border-left: 4px solid var(--green);
            padding: 20px;
            margin: 20px 0;
            overflow-x: auto;
            font-family: 'IBM Plex Mono', monospace;
            font-size: 0.85rem;
            line-height: 1.6;
        }

        .code-block pre {
            margin: 0;
            color: var(--text-primary);
        }

        .tip-box {
            background: rgba(88, 166, 255, 0.1);
            border-left: 4px solid var(--accent);
            padding: 20px;
            margin: 20px 0;
        }

        .tip-box strong {
            color: var(--accent);
        }

        .warning-box {
            background: rgba(248, 81, 73, 0.1);
            border-left: 4px solid var(--red);
            padding: 20px;
            margin: 20px 0;
        }

        .warning-box strong {
            color: var(--red);
        }

        table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
        }

        th, td {
            padding: 15px;
            text-align: left;
            border-bottom: 2px solid var(--border);
        }

        th {
            background: var(--bg-darker);
            color: var(--accent);
            font-weight: 700;
            text-transform: uppercase;
        }

        tr:hover {
            background: rgba(88, 166, 255, 0.05);
        }

        a {
            color: var(--accent);
            text-decoration: none;
            border-bottom: 2px solid transparent;
            transition: all 0.3s;
        }

        a:hover {
            border-bottom-color: var(--accent);
        }

        .resource-grid {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
            gap: 20px;
            margin: 25px 0;
        }

        .resource-card {
            background: var(--bg-darker);
            border: 2px solid var(--border);
            padding: 25px;
            transition: all 0.3s;
        }

        .resource-card:hover {
            border-color: var(--purple);
            transform: translateY(-5px);
        }

        .resource-card h4 {
            color: var(--purple);
            margin-bottom: 10px;
            font-size: 1.2rem;
        }

        .paper-item {
            background: var(--bg-darker);
            border-left: 4px solid var(--purple);
            padding: 20px;
            margin: 15px 0;
        }

        .paper-title {
            color: var(--purple);
            font-size: 1.1rem;
            font-weight: 700;
            margin-bottom: 10px;
        }

        footer {
            margin-top: 80px;
            padding: 40px 20px;
            border-top: 2px solid var(--border);
            text-align: center;
        }

        @media (max-width: 768px) {
            h1 { font-size: 1.8rem; }
            .resource-grid { grid-template-columns: 1fr; }
        }
    </style>
</head>
<body>
    <header>
        <div class="container">
            <h1>[ LLM TRAINING MASTERCLASS ]</h1>
            <p style="color: var(--text-secondary); margin-top: 10px;">Build State-of-the-Art Language Models | From Theory to Production</p>
        </div>
    </header>

    <div class="container">
        <div class="tabs">
            <button class="tab active" data-tab="intro">INTRO</button>
            <button class="tab" data-tab="arch">ARCHITECTURE</button>
            <button class="tab" data-tab="data">DATASETS</button>
            <button class="tab" data-tab="train">TRAINING</button>
            <button class="tab" data-tab="opt">OPTIMIZATION</button>
            <button class="tab" data-tab="eval">EVALUATION</button>
            <button class="tab" data-tab="papers">PAPERS</button>
            <button class="tab" data-tab="impl">IMPLEMENTATION</button>
            <button class="tab" data-tab="deploy">DEPLOYMENT</button>
        </div>

        <!-- INTRO -->
        <div class="content-section active" id="intro">
            <div class="module">
                <h2>[ SYSTEM OVERVIEW ]</h2>
                <p>this is your complete guide to building state-of-the-art large language models. from foundational transformer architecture to deployment at scale. everything you need to go from research papers to production models.</p>
                
                <h3>what you'll master</h3>
                <ul>
                    <li>transformer architecture internals (attention, ffn, normalization)</li>
                    <li>dataset curation and preprocessing at scale</li>
                    <li>pre-training, instruction tuning, RLHF, and DPO</li>
                    <li>scaling laws and compute-optimal training (chinchilla)</li>
                    <li>distributed training (FSDP, DeepSpeed ZeRO)</li>
                    <li>mixed precision, quantization, LoRA</li>
                    <li>evaluation frameworks and benchmarks</li>
                    <li>production deployment strategies</li>
                </ul>

                <h3>key resources</h3>
                <ul>
                    <li><a href="https://huggingface.co/docs/transformers" target="_blank">HuggingFace Transformers Docs</a></li>
                    <li><a href="https://arxiv.org/list/cs.CL/recent" target="_blank">ArXiv cs.CL (Computation & Language)</a></li>
                    <li><a href="https://github.com/Hannibal046/Awesome-LLM" target="_blank">Awesome-LLM GitHub</a></li>
                    <li><a href="https://www.deepspeed.ai/" target="_blank">DeepSpeed Documentation</a></li>
                </ul>

                <div class="tip-box">
                    <strong>RECOMMENDED PATH:</strong> follow modules in order. each builds on previous concepts. start small (1-7B params), reproduce papers, then scale up. hands-on practice > theory.
                </div>
            </div>
        </div>

        <!-- ARCHITECTURE -->
        <div class="content-section" id="arch">
            <div class="module">
                <h2>[ TRANSFORMER ARCHITECTURE ]</h2>
                
                <h3>core components</h3>
                <ul>
                    <li><strong>self-attention:</strong> QKV projections, scaled dot-product, softmax over context</li>
                    <li><strong>multi-head attention:</strong> parallel attention with different learned projections</li>
                    <li><strong>feed-forward network:</strong> position-wise FFN, typically 4x hidden dim, stores factual knowledge</li>
                    <li><strong>layer normalization:</strong> pre-norm vs post-norm, RMSNorm for efficiency</li>
                    <li><strong>residual connections:</strong> enable deep networks, gradient flow</li>
                </ul>

                <div class="code-block">
                    <pre># simplified attention implementation
import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k))
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attention_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attention_weights, V)</pre>
                </div>

                <h3>modern architectures</h3>
                <div class="resource-grid">
                    <div class="resource-card">
                        <h4>GPT</h4>
                        <p>decoder-only, causal attention, learned positional embeddings</p>
                        <a href="https://arxiv.org/abs/2005.14165" target="_blank">GPT-3 paper →</a>
                    </div>
                    <div class="resource-card">
                        <h4>LLaMA</h4>
                        <p>RMSNorm, SwiGLU activation, RoPE positional embeddings</p>
                        <a href="https://arxiv.org/abs/2302.13971" target="_blank">LLaMA paper →</a>
                    </div>
                    <div class="resource-card">
                        <h4>Mistral</h4>
                        <p>sliding window attention, grouped-query attention (GQA)</p>
                        <a href="https://arxiv.org/abs/2310.06825" target="_blank">Mistral paper →</a>
                    </div>
                </div>

                <h3>attention optimizations</h3>
                <ul>
                    <li><strong>Flash Attention:</strong> IO-aware algorithm, 2-4x speedup, exact attention</li>
                    <li><strong>Multi-Query Attention (MQA):</strong> shared K/V across heads, faster inference</li>
                    <li><strong>Grouped-Query Attention (GQA):</strong> balance between MHA and MQA</li>
                    <li><strong>Sliding Window:</strong> local attention for long contexts</li>
                </ul>

                <div class="tip-box">
                    <strong>KEY INSIGHT:</strong> attention enables contextual understanding (look at all previous tokens), while FFN enables factual recall (parameters store knowledge). this combination is what makes transformers powerful.
                </div>
            </div>
        </div>

        <!-- DATASETS -->
        <div class="content-section" id="data">
            <div class="module">
                <h2>[ DATASET ENGINEERING ]</h2>
                
                <h3>pre-training datasets</h3>
                <table>
                    <tr>
                        <th>Dataset</th>
                        <th>Size</th>
                        <th>Type</th>
                        <th>Link</th>
                    </tr>
                    <tr>
                        <td>The Pile</td>
                        <td>825 GB</td>
                        <td>diverse text corpus</td>
                        <td><a href="https://huggingface.co/datasets/EleutherAI/pile" target="_blank">view</a></td>
                    </tr>
                    <tr>
                        <td>RedPajama</td>
                        <td>1.2 TB</td>
                        <td>LLaMA training data replica</td>
                        <td><a href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T" target="_blank">view</a></td>
                    </tr>
                    <tr>
                        <td>RefinedWeb</td>
                        <td>5 TB</td>
                        <td>high-quality web data</td>
                        <td><a href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb" target="_blank">view</a></td>
                    </tr>
                    <tr>
                        <td>FineWeb</td>
                        <td>15 TB</td>
                        <td>deduplicated web corpus</td>
                        <td><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb" target="_blank">view</a></td>
                    </tr>
                    <tr>
                        <td>The Stack</td>
                        <td>6 TB</td>
                        <td>code (permissive licenses)</td>
                        <td><a href="https://huggingface.co/datasets/bigcode/the-stack" target="_blank">view</a></td>
                    </tr>
                </table>

                <h3>instruction & chat datasets</h3>
                <table>
                    <tr>
                        <th>Dataset</th>
                        <th>Size</th>
                        <th>Purpose</th>
                    </tr>
                    <tr>
                        <td>OpenAssistant</td>
                        <td>161K</td>
                        <td>multi-turn conversations</td>
                    </tr>
                    <tr>
                        <td>UltraChat</td>
                        <td>1.5M</td>
                        <td>diverse dialogues</td>
                    </tr>
                    <tr>
                        <td>Dolly 15K</td>
                        <td>15K</td>
                        <td>human-generated instructions</td>
                    </tr>
                </table>

                <h3>data processing pipeline</h3>
                <ul>
                    <li><strong>deduplication:</strong> exact and fuzzy matching (MinHash, SimHash)</li>
                    <li><strong>quality filtering:</strong> perplexity scores, length filters, language detection</li>
                    <li><strong>PII removal:</strong> regex patterns, NER models, privacy protection</li>
                    <li><strong>toxicity filtering:</strong> perspective API, custom classifiers</li>
                    <li><strong>format standardization:</strong> consistent tokenization, special tokens</li>
                </ul>

                <div class="warning-box">
                    <strong>DATA QUALITY MATTERS:</strong> garbage in = garbage out. invest heavily in data curation. better data > better architecture. chinchilla showed most models are undertrained due to insufficient quality data.
                </div>

                <h3>tokenization</h3>
                <div class="code-block">
                    <pre># train BPE tokenizer with HuggingFace
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

trainer = trainers.BpeTrainer(
    vocab_size=50000,
    special_tokens=["<pad>", "<s>", "</s>", "<unk>", "<mask>"]
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)
tokenizer.save("tokenizer.json")</pre>
                </div>
            </div>
        </div>

        <!-- TRAINING -->
        <div class="content-section" id="train">
            <div class="module">
                <h2>[ TRAINING METHODOLOGIES ]</h2>
                
                <h3>pre-training from scratch</h3>
                <ul>
                    <li><strong>initialization:</strong> scaled random init, He/Xavier initialization</li>
                    <li><strong>warmup:</strong> linear warmup for 2K-10K steps prevents instability</li>
                    <li><strong>learning rate schedule:</strong> cosine decay with warmup</li>
                    <li><strong>batch size:</strong> large batches (2M-4M tokens) for stability</li>
                    <li><strong>gradient clipping:</strong> max norm 1.0 to prevent explosions</li>
                </ul>

                <h3>scaling laws (chinchilla)</h3>
                <ul>
                    <li>compute-optimal: ~20 tokens per parameter</li>
                    <li>doubling compute → increase model size AND data equally</li>
                    <li>most models are undertrained (too few tokens)</li>
                    <li>70B model needs ~1.4T tokens for optimal training</li>
                </ul>

                <div class="code-block">
                    <pre># typical pre-training config
{
    "hidden_size": 4096,
    "num_layers": 32,
    "num_heads": 32,
    "intermediate_size": 11008,
    "vocab_size": 50000,
    "max_position_embeddings": 4096,
    "learning_rate": 3e-4,
    "warmup_steps": 2000,
    "max_steps": 500000,
    "batch_size": 256,
    "gradient_accumulation_steps": 16,
    "weight_decay": 0.1,
    "bf16": true
}</pre>
                </div>

                <h3>instruction tuning (SFT)</h3>
                <ul>
                    <li>convert base model to instruction-following</li>
                    <li>much lower LR than pre-training (1e-5 to 5e-5)</li>
                    <li>2-3 epochs typically sufficient</li>
                    <li>50K-200K high-quality examples</li>
                </ul>

                <h3>RLHF pipeline</h3>
                <ul>
                    <li><strong>step 1:</strong> train reward model on preference pairs</li>
                    <li><strong>step 2:</strong> use PPO to optimize policy against reward</li>
                    <li><strong>step 3:</strong> KL penalty to prevent drift from SFT model</li>
                    <li>complex, requires careful tuning, prone to reward hacking</li>
                </ul>

                <h3>DPO (simpler alternative)</h3>
                <ul>
                    <li>train directly on preference pairs, no reward model needed</li>
                    <li>more stable than PPO, easier to implement</li>
                    <li>better for smaller teams and limited compute</li>
                </ul>

                <div class="code-block">
                    <pre># DPO training with TRL
from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    learning_rate=5e-7,
    beta=0.1,  # KL penalty
    num_train_epochs=3,
    per_device_train_batch_size=4,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

trainer.train()</pre>
                </div>

                <div class="tip-box">
                    <strong>TRAINING PHASES:</strong> pre-training (learn language) → instruction tuning (learn to follow instructions) → alignment (learn human preferences). each phase serves a distinct purpose.
                </div>
            </div>
        </div>

        <!-- OPTIMIZATION -->
        <div class="content-section" id="opt">
            <div class="module">
                <h2>[ OPTIMIZATION & EFFICIENCY ]</h2>
                
                <h3>mixed precision training</h3>
                <ul>
                    <li><strong>BF16:</strong> best for training, stable, same range as FP32</li>
                    <li><strong>FP16:</strong> needs loss scaling, but widely supported</li>
                    <li>2x memory reduction, 2-3x speedup on modern GPUs</li>
                    <li>use torch.amp for automatic mixed precision</li>
                </ul>

                <h3>gradient checkpointing</h3>
                <ul>
                    <li>trade compute for memory (recompute activations in backward)</li>
                    <li>~30-40% slower but uses ~50% less memory</li>
                    <li>essential for training large models on limited VRAM</li>
                    <li>enables longer sequences and larger batch sizes</li>
                </ul>

                <h3>distributed training</h3>
                <ul>
                    <li><strong>DDP:</strong> data parallel, replicate model across GPUs</li>
                    <li><strong>FSDP:</strong> fully sharded, split params/grads/optimizer across GPUs</li>
                    <li><strong>DeepSpeed ZeRO:</strong> stage 1 (optimizer), stage 2 (gradients), stage 3 (parameters)</li>
                    <li><strong>pipeline parallelism:</strong> split layers across GPUs</li>
                    <li><strong>tensor parallelism:</strong> split individual layers across GPUs</li>
                </ul>

                <div class="code-block">
                    <pre># FSDP training
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

model = FSDP(
    model,
    mixed_precision=MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.bfloat16,
    ),
    sharding_strategy="FULL_SHARD",
    device_id=torch.cuda.current_device(),
)

# train normally - FSDP handles sharding automatically</pre>
                </div>

                <h3>LoRA (parameter-efficient fine-tuning)</h3>
                <ul>
                    <li>learn low-rank updates to weight matrices</li>
                    <li>train only 0.1-1% of parameters</li>
                    <li>much less memory, faster training</li>
                    <li>merge adapters back into model after training</li>
                </ul>

                <div class="code-block">
                    <pre># LoRA with PEFT
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# only ~0.5% params trainable, huge memory savings</pre>
                </div>

                <h3>quantization</h3>
                <ul>
                    <li><strong>8-bit:</strong> bitsandbytes, minimal quality loss</li>
                    <li><strong>4-bit:</strong> QLoRA, GPTQ for extreme compression</li>
                    <li>enables running 70B models on consumer GPUs</li>
                    <li>inference speedup + memory reduction</li>
                </ul>

                <div class="warning-box">
                    <strong>OOM ERRORS?</strong> try in order: (1) reduce batch size, (2) enable gradient checkpointing, (3) use FSDP/ZeRO-3, (4) use LoRA, (5) reduce model size or sequence length.
                </div>
            </div>
        </div>

        <!-- EVALUATION -->
        <div class="content-section" id="eval">
            <div class="module">
                <h2>[ EVALUATION & BENCHMARKS ]</h2>
                
                <h3>key benchmarks</h3>
                <table>
                    <tr>
                        <th>Benchmark</th>
                        <th>Measures</th>
                        <th>Link</th>
                    </tr>
                    <tr>
                        <td>MMLU</td>
                        <td>multitask accuracy (57 subjects)</td>
                        <td><a href="https://huggingface.co/datasets/cais/mmlu" target="_blank">view</a></td>
                    </tr>
                    <tr>
                        <td>HellaSwag</td>
                        <td>commonsense reasoning</td>
                        <td><a href="https://huggingface.co/datasets/Rowan/hellaswag" target="_blank">view</a></td>
                    </tr>
                    <tr>
                        <td>TruthfulQA</td>
                        <td>truthfulness, avoiding misconceptions</td>
                        <td><a href="https://huggingface.co/datasets/truthful_qa" target="_blank">view</a></td>
                    </tr>
                    <tr>
                        <td>GSM8K</td>
                        <td>grade school math reasoning</td>
                        <td><a href="https://huggingface.co/datasets/gsm8k" target="_blank">view</a></td>
                    </tr>
                    <tr>
                        <td>HumanEval</td>
                        <td>code generation (python)</td>
                        <td><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">view</a></td>
                    </tr>
                </table>

                <h3>leaderboards</h3>
                <ul>
                    <li><a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">Open LLM Leaderboard</a> - comprehensive benchmark suite</li>
                    <li><a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" target="_blank">Chatbot Arena</a> - ELO ratings from human preferences</li>
                    <li><a href="https://tatsu-lab.github.io/alpaca_eval/" target="_blank">AlpacaEval</a> - instruction-following evaluation</li>
                </ul>

                <h3>running evaluations</h3>
                <div class="code-block">
                    <pre># lm-eval-harness
pip install lm-eval

lm_eval --model hf \
    --model_args pretrained=your-model \
    --tasks mmlu,hellaswag,gsm8k,humaneval \
    --device cuda \
    --batch_size 8 \
    --output_path results/</pre>
                </div>

                <h3>metrics to track</h3>
                <ul>
                    <li><strong>perplexity:</strong> exp(cross-entropy loss), lower = better</li>
                    <li><strong>accuracy:</strong> exact match on multiple choice / classification</li>
                    <li><strong>pass@k:</strong> code correctness (k samples)</li>
                    <li><strong>human eval:</strong> side-by-side comparisons, ELO ratings</li>
                </ul>

                <div class="tip-box">
                    <strong>EVALUATION BEST PRACTICES:</strong> use multiple benchmarks, watch for contamination, prioritize human eval for alignment tasks, track both capability and safety metrics.
                </div>
            </div>
        </div>

        <!-- PAPERS -->
        <div class="content-section" id="papers">
            <div class="module">
                <h2>[ ESSENTIAL RESEARCH PAPERS ]</h2>
                
                <h3>foundational papers</h3>
                
                <div class="paper-item">
                    <div class="paper-title">Attention Is All You Need (2017)</div>
                    <p>Vaswani et al., Google</p>
                    <p>introduced transformer architecture. self-attention, multi-head attention, positional encodings.</p>
                    <a href="https://arxiv.org/abs/1706.03762" target="_blank">read paper →</a>
                </div>

                <div class="paper-item">
                    <div class="paper-title">GPT-3: Language Models are Few-Shot Learners (2020)</div>
                    <p>Brown et al., OpenAI</p>
                    <p>demonstrated emergent in-context learning at 175B parameters. scaling laws.</p>
                    <a href="https://arxiv.org/abs/2005.14165" target="_blank">read paper →</a>
                </div>

                <div class="paper-item">
                    <div class="paper-title">Training Compute-Optimal Large Language Models (2022)</div>
                    <p>Hoffmann et al., DeepMind</p>
                    <p>chinchilla scaling laws. most LLMs undertrained. ~20 tokens per parameter optimal.</p>
                    <a href="https://arxiv.org/abs/2203.15556" target="_blank">read paper →</a>
                </div>

                <h3>architecture & optimization</h3>
                
                <div class="paper-item">
                    <div class="paper-title">Flash Attention (2022)</div>
                    <p>Dao et al., Stanford</p>
                    <p>IO-aware attention algorithm. 2-4x speedup with no approximation.</p>
                    <a href="https://arxiv.org/abs/2205.14135" target="_blank">read paper →</a>
                </div>

                <div class="paper-item">
                    <div class="paper-title">LLaMA: Open and Efficient Foundation Language Models (2023)</div>
                    <p>Touvron et al., Meta</p>
                    <p>RMSNorm, SwiGLU, RoPE. strong performance at 7B-65B scale.</p>
                    <a href="https://arxiv.org/abs/2302.13971" target="_blank">read paper →</a>
                </div>

                <div class="paper-item">
                    <div class="paper-title">LoRA: Low-Rank Adaptation (2021)</div>
                    <p>Hu et al., Microsoft</p>
                    <p>parameter-efficient fine-tuning. train 0.1% of params with minimal quality loss.</p>
                    <a href="https://arxiv.org/abs/2106.09685" target="_blank">read paper →</a>
                </div>

                <h3>training & alignment</h3>
                
                <div class="paper-item">
                    <div class="paper-title">InstructGPT (2022)</div>
                    <p>Ouyang et al., OpenAI</p>
                    <p>RLHF methodology. reward modeling + PPO for alignment.</p>
                    <a href="https://arxiv.org/abs/2203.02155" target="_blank">read paper →</a>
                </div>

                <div class="paper-item">
                    <div class="paper-title">Direct Preference Optimization (2023)</div>
                    <p>Rafailov et al., Stanford</p>
                    <p>train on preferences directly without reward model. simpler than RLHF.</p>
                    <a href="https://arxiv.org/abs/2305.18290" target="_blank">read paper →</a>
                </div>

                <div class="paper-item">
                    <div class="paper-title">Constitutional AI (2022)</div>
                    <p>Bai et al., Anthropic</p>
                    <p>self-critique and principle-based feedback. scalable oversight.</p>
                    <a href="https://arxiv.org/abs/2212.08073" target="_blank">read paper →</a>
                </div>

                <h3>more essential reads</h3>
                <ul>
                    <li><a href="https://arxiv.org/abs/1810.04805" target="_blank">BERT</a> - masked language modeling, bidirectional pre-training</li>
                    <li><a href="https://arxiv.org/abs/2104.09864" target="_blank">RoFormer (RoPE)</a> - rotary position embeddings</li>
                    <li><a href="https://arxiv.org/abs/1910.02054" target="_blank">ZeRO</a> - memory optimizations for trillion-param models</li>
                    <li><a href="https://arxiv.org/abs/2305.14314" target="_blank">QLoRA</a> - 4-bit quantization + LoRA</li>
                    <li><a href="https://arxiv.org/abs/2201.11903" target="_blank">Chain-of-Thought</a> - step-by-step reasoning improves capabilities</li>
                </ul>
            </div>
        </div>

        <!-- IMPLEMENTATION -->
        <div class="content-section" id="impl">
            <div class="module">
                <h2>[ IMPLEMENTATION GUIDE ]</h2>
                
                <h3>training stack</h3>
                <ul>
                    <li><strong>PyTorch:</strong> primary deep learning framework</li>
                    <li><strong>HuggingFace Transformers:</strong> model implementations</li>
                    <li><strong>HuggingFace Accelerate:</strong> distributed training abstraction</li>
                    <li><strong>DeepSpeed:</strong> optimization and scaling</li>
                    <li><strong>TRL:</strong> RLHF and alignment tools</li>
                    <li><strong>Weights & Biases:</strong> experiment tracking</li>
                </ul>

                <h3>complete training script</h3>
                <div class="code-block">
                    <pre># train.py
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset

# load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "base-model",
    torch_dtype=torch.bfloat16,
    use_cache=False,
)
model.gradient_checkpointing_enable()

tokenizer = AutoTokenizer.from_pretrained("base-model")

# load and tokenize dataset
dataset = load_dataset("your-dataset")

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
    )

tokenized = dataset.map(tokenize, batched=True)

# training arguments
args = TrainingArguments(
    output_dir="./model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    learning_rate=3e-4,
    warmup_steps=2000,
    lr_scheduler_type="cosine",
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=100,
    save_steps=1000,
)

# train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()
trainer.save_model()</pre>
                </div>

                <h3>instruction tuning</h3>
                <div class="code-block">
                    <pre># instruction tuning with TRL
from trl import SFTTrainer

# format dataset
def format_instruction(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    }

dataset = dataset.map(format_instruction)

# train
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()</pre>
                </div>

                <h3>multi-GPU training</h3>
                <div class="code-block">
                    <pre># launch with accelerate
accelerate config  # run once to configure

accelerate launch train.py

# or use torchrun
torchrun --nproc_per_node=8 train.py

# or DeepSpeed
deepspeed train.py --deepspeed ds_config.json</pre>
                </div>

                <div class="tip-box">
                    <strong>DEBUGGING CHECKLIST:</strong> loss is NaN → reduce LR | OOM → reduce batch size, enable checkpointing | slow training → check GPU util, increase batch size | loss not decreasing → verify data quality and labels
                </div>
            </div>
        </div>

        <!-- DEPLOYMENT -->
        <div class="content-section" id="deploy">
            <div class="module">
                <h2>[ DEPLOYMENT & PRODUCTION ]</h2>
                
                <h3>inference optimization</h3>
                <ul>
                    <li><strong>quantization:</strong> 8-bit, 4-bit (GPTQ, bitsandbytes)</li>
                    <li><strong>model pruning:</strong> remove redundant weights</li>
                    <li><strong>distillation:</strong> train smaller model to mimic larger</li>
                    <li><strong>speculative decoding:</strong> draft + verify for faster generation</li>
                </ul>

                <h3>serving frameworks</h3>
                <div class="resource-grid">
                    <div class="resource-card">
                        <h4>vLLM</h4>
                        <p>PagedAttention, highest throughput, production-ready</p>
                        <a href="https://github.com/vllm-project/vllm" target="_blank">github →</a>
                    </div>
                    <div class="resource-card">
                        <h4>Text Generation Inference</h4>
                        <p>HuggingFace's official server, easy to use</p>
                        <a href="https://github.com/huggingface/text-generation-inference" target="_blank">github →</a>
                    </div>
                    <div class="resource-card">
                        <h4>Ollama</h4>
                        <p>run LLMs locally with ease, great for development</p>
                        <a href="https://ollama.ai/" target="_blank">website →</a>
                    </div>
                </div>

                <h3>vLLM deployment</h3>
                <div class="code-block">
                    <pre># start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model your-model \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --max-model-len 4096

# use OpenAI-compatible client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "hello!"}]
)</pre>
                </div>

                <h3>monitoring</h3>
                <ul>
                    <li>latency (p50, p95, p99 percentiles)</li>
                    <li>throughput (requests/sec, tokens/sec)</li>
                    <li>error rate and failure modes</li>
                    <li>GPU utilization and memory</li>
                    <li>cost per request</li>
                </ul>

                <h3>safety & content filtering</h3>
                <ul>
                    <li>input validation and sanitization</li>
                    <li>output content filtering (toxicity, PII)</li>
                    <li>rate limiting and abuse prevention</li>
                    <li>logging for audit and improvement</li>
                </ul>

                <div class="warning-box">
                    <strong>PRODUCTION CHECKLIST:</strong> implement error handling, monitoring, rate limiting, cost tracking, and safety measures BEFORE serving to users. test with production-like loads.
                </div>
            </div>

            <div class="module">
                <h2>[ NEXT STEPS ]</h2>
                
                <h3>start small, scale up</h3>
                <ul>
                    <li>begin with 1-7B parameter models</li>
                    <li>reproduce a paper's results to validate setup</li>
                    <li>fine-tune existing models on your domain</li>
                    <li>contribute to open source projects</li>
                    <li>share findings with the community</li>
                </ul>

                <h3>community resources</h3>
                <ul>
                    <li><a href="https://discuss.huggingface.co/" target="_blank">HuggingFace Forums</a> - active Q&A community</li>
                    <li><a href="https://discord.gg/eleutherai" target="_blank">EleutherAI Discord</a> - open source LLM research</li>
                    <li><a href="https://reddit.com/r/LocalLLaMA" target="_blank">r/LocalLLaMA</a> - local LLM enthusiasts</li>
                </ul>

                <h3>follow research</h3>
                <ul>
                    <li><strong>ArXiv:</strong> <a href="https://arxiv.org/list/cs.CL/recent" target="_blank">cs.CL</a>, <a href="https://arxiv.org/list/cs.LG/recent" target="_blank">cs.LG</a>, <a href="https://arxiv.org/list/cs.AI/recent" target="_blank">cs.AI</a></li>
                    <li><strong>Labs:</strong> Anthropic, OpenAI, Google DeepMind, Meta AI, Mistral AI</li>
                    <li><strong>Open Source:</strong> EleutherAI, HuggingFace, Together AI</li>
                </ul>

                <div class="tip-box">
                    <strong>FINAL ADVICE:</strong> the field moves fast. stay curious, experiment often, don't be afraid to fail. consistent practice > reading papers. build, deploy, iterate. the best way to learn is by doing.
                </div>
            </div>
        </div>

    </div>

    <footer>
        <p>[ BUILD GREAT MODELS ]</p>
        <p style="margin-top: 10px;">keep learning. keep building. keep shipping.</p>
        <p style="margin-top: 20px; font-size: 0.85rem;">
            powered by the open source community | <a href="https://huggingface.co" target="_blank">huggingface.co</a>
        </p>
    </footer>

    <script>
        // tab switching
        const tabs = document.querySelectorAll('.tab');
        const sections = document.querySelectorAll('.content-section');
        
        tabs.forEach(tab => {
            tab.addEventListener('click', () => {
                const targetTab = tab.dataset.tab;
                
                tabs.forEach(t => t.classList.remove('active'));
                sections.forEach(s => s.classList.remove('active'));
                
                tab.classList.add('active');
                document.getElementById(targetTab).classList.add('active');
                
                // smooth scroll to top
                window.scrollTo({ top: 0, behavior: 'smooth' });
            });
        });

        // keyboard navigation
        document.addEventListener('keydown', (e) => {
            if (e.key === 'ArrowRight') {
                const activeTab = document.querySelector('.tab.active');
                const nextTab = activeTab.nextElementSibling;
                if (nextTab && nextTab.classList.contains('tab')) {
                    nextTab.click();
                }
            } else if (e.key === 'ArrowLeft') {
                const activeTab = document.querySelector('.tab.active');
                const prevTab = activeTab.previousElementSibling;
                if (prevTab && prevTab.classList.contains('tab')) {
                    prevTab.click();
                }
            }
        });

        // add visual feedback on code blocks
        document.querySelectorAll('.code-block').forEach(block => {
            block.addEventListener('click', () => {
                const selection = window.getSelection();
                const range = document.createRange();
                range.selectNodeContents(block);
                selection.removeAllRanges();
                selection.addRange(range);
            });
        });
    </script>
</body>
</html>