Transformer Architecture: Cuộc Cách mạng "Attention is All You Need"

Năm 2017, một paper từ Google với cái tên ngạo nghễ "Attention is All You Need" đã thay đổi hoàn toàn AI. Không phải vì nó cải thiện vài % accuracy, mà vì nó phá vỡ paradigm của NLP trong 20+ năm.

Trước Transformer, state-of-the-art NLP models đều dựa trên Recurrent Neural Networks (RNNs) - process text tuần tự từ trái sang phải. Nhưng RNNs có vấn đề:

  • Chậm: Không thể parallelize (phải đợi word trước xong mới process word sau)
  • Vanishing gradients: Khó học long-range dependencies
  • Limited context: Forget thông tin từ đầu câu khi đến cuối câu

Transformer giải quyết TẤT CẢ bằng một cơ chế: Self-Attention - cho phép model "nhìn" toàn bộ input cùng lúc và tự quyết định phần nào quan trọng.

Kết quả? GPT, BERT, T5, và mọi LLM hiện đại đều dựa trên Transformer architecture.

Self-Attention: "Nhìn" vào Chính Mình

Intuition

Ví dụ: Câu "The animal didn't cross the street because it was too tired."

Câu hỏi: "it" refer đến "animal" hay "street"?

Với con người, chúng ta biết:

  • "tired" → more likely về "animal" (animals get tired, streets don't)
  • "it" → refers to "animal"

Self-Attention cho phép model học relationship này:

"it" attends to:
- "animal": High attention (0.8) ← Strong relation
- "street": Low attention (0.1)
- "tired": Medium attention (0.3)

Self-Attention Mechanism

Goal: Với mỗi word, tính weighted sum của tất cả words trong sequence dựa trên "relevance".

Ví dụ đơn giản: Câu "The cat sat"

Step 1: Create Query, Key, Value vectors

Mỗi word được represent bằng 3 vectors:

  • Query (Q): "What I'm looking for"
  • Key (K): "What I offer"
  • Value (V): "What I actually contain"
Word Embeddings (giả sử dim=4):
"The":  [1, 0, 0, 0]
"cat":  [0, 1, 0, 0]
"sat":  [0, 0, 1, 0]

Linear transformations (học được):
Q = Embedding × W_Q
K = Embedding × W_K
V = Embedding × W_V

Step 2: Calculate Attention Scores

Với mỗi word, tính similarity với tất cả words (bao gồm chính nó).

Score(word_i, word_j) = Q_i · K_j  (dot product)

For "cat":
Q_cat · K_The = 0.2
Q_cat · K_cat = 0.9  ← High (attend to itself)
Q_cat · K_sat = 0.6

Step 3: Normalize với Softmax

Attention_weights = softmax(scores / √d_k)

√d_k: Scaling factor (tránh gradients quá nhỏ)

After softmax:
"cat" attends to:
- "The": 0.15
- "cat": 0.60  ← Most attention to itself
- "sat": 0.25
Total = 1.0

Step 4: Weighted Sum of Values

Output_cat = 0.15 × V_The + 0.60 × V_cat + 0.25 × V_sat

Result: Representation của "cat" bây giờ có context từ cả câu!

Multi-Head Attention: "Nhiều Góc Nhìn"

Problem: Single attention chỉ capture 1 type of relationship.

Ví dụ: "The cat sat on the mat"

  • Attention head 1: Focus on subject-verb (cat ↔ sat)
  • Attention head 2: Focus on object relationship (sat ↔ mat)
  • Attention head 3: Focus on article-noun (the ↔ cat)

Multi-Head Attention: Run multiple attention heads song song, concat outputs.

Input
  ↓
[Head 1] [Head 2] [Head 3] ... [Head 8]
  ↓        ↓        ↓             ↓
Concat → Linear → Output

Ví dụ với 8 heads:

# Each head has separate Q, K, V matrices
for i in range(8):
    Q_i = input @ W_Q_i
    K_i = input @ W_K_i
    V_i = input @ W_V_i
    
    attention_i = softmax(Q_i @ K_i.T / sqrt(d_k)) @ V_i

# Concatenate all heads
multi_head_output = concat([attention_1, ..., attention_8])
output = multi_head_output @ W_O

Benefit: Model học được diverse relationships simultaneously.

Transformer Architecture: Encoder-Decoder

Full Architecture

Input Sentence                Output Sentence
    ↓                              ↑
[Embedding]                   [Embedding]
    ↓                              ↑
[Positional                   [Positional
 Encoding]                     Encoding]
    ↓                              ↑
┌─────────────┐              ┌─────────────┐
│  ENCODER    │              │  DECODER    │
│  (6 layers) │──────────────│  (6 layers) │
│             │   Attention  │             │
└─────────────┘              └─────────────┘
                                  ↓
                             [Linear + Softmax]
                                  ↓
                            Predicted Tokens

Encoder Stack

Mỗi Encoder layer:

Input
  ↓
[Multi-Head Self-Attention]
  ↓
[Add & Normalize]  ← Residual connection + Layer Norm
  ↓
[Feed-Forward Network]
  ↓
[Add & Normalize]
  ↓
Output

Feed-Forward Network:

FFN(x) = ReLU(x W₁ + b₁) W₂ + b₂

Typically: 
- Input dim: 512
- Hidden dim: 2048 (4x expansion)
- Output dim: 512

Layer Normalization:

LayerNorm(x) = γ × (x - μ) / σ + β

Normalize across features (not batch) cho mỗi sample
→ Stabilize training

Residual Connections:

Output = LayerNorm(x + MultiHeadAttention(x))
Output = LayerNorm(x + FFN(x))

→ Gradient flow tốt hơn, train deep networks dễ hơn

Decoder Stack

Mỗi Decoder layer:

Input (shifted right)
  ↓
[Masked Multi-Head Self-Attention]  ← Can't see future tokens
  ↓
[Add & Normalize]
  ↓
[Multi-Head Cross-Attention]  ← Attend to Encoder outputs
  ↓
[Add & Normalize]
  ↓
[Feed-Forward Network]
  ↓
[Add & Normalize]
  ↓
Output

Key differences vs Encoder:

1. Masked Self-Attention:

Input: "I love cats"
Predicting: "<start> I love cats"

When predicting "love":
✓ Can see: "<start>", "I"
✗ Cannot see: "love", "cats"  ← Future tokens masked

Attention mask:
     <start>  I  love  cats
<start>  1    0    0    0
I        1    1    0    0
love     1    1    1    0
cats     1    1    1    1

(1 = can attend, 0 = masked)

Tại sao mask? Để prevent "cheating" during training (model không được nhìn answer).

2. Cross-Attention:

Decoder attend to Encoder outputs.

Queries: từ Decoder
Keys, Values: từ Encoder

→ Decoder focus vào relevant parts của input sentence

Ví dụ: Translation

Input (Encoder): "I love cats"
Output (Decoder): "J'aime les chats"

Khi predict "chats" (cats):
Cross-attention weights:
- "I": 0.1
- "love": 0.1
- "cats": 0.8  ← High attention!

Positional Encoding: Thêm Thông tin Vị trí

Problem: Self-attention không có khái niệm về order/position.

"The cat sat" vs "sat cat The" 
→ Same attention scores! (order-invariant)

Solution: Thêm positional information vào embeddings.

Sinusoidal Positional Encoding:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

pos: Position trong sequence (0, 1, 2, ...)
i: Dimension index
d_model: Embedding dimension (512)

Ví dụ:

Position 0: [sin(0/10000^0), cos(0/10000^0), sin(0/10000^(2/512)), ...]
Position 1: [sin(1/10000^0), cos(1/10000^0), sin(1/10000^(2/512)), ...]
...

Properties:

  • Deterministic (không cần learn)
  • Unique cho mỗi position
  • Relative positions có consistent patterns

Alternative: Learned positional embeddings (GPT-2, BERT dùng cách này).

Tokenization: Chuyển Text thành Numbers

Problem: Transformers chỉ hiểu numbers, không hiểu text.

Word-level Tokenization (Naive)

Vocabulary: {"I": 1, "love": 2, "cats": 3, "dogs": 4, ...}

Sentence: "I love cats"
Tokens: [1, 2, 3]

Problems:

  • Huge vocabulary: English có ~500k words → 500k vocab size!
  • OOV (Out of Vocabulary): Gặp word mới (VD: "coronavirus" năm 2019) → không biết encode
  • No morphological understanding: "love", "loves", "loved", "loving" → 4 tokens riêng biệt

Subword Tokenization: The Solution

Idea: Break words thành subword units.

Ví dụ với BPE (Byte Pair Encoding):

Vocabulary: {"I", "love", "cat", "s", "dog", "##s", ...}

"I love cats" → ["I", "love", "cat", "##s"]
"I love dogs" → ["I", "love", "dog", "##s"]

"coronavirus" (OOV) → ["corona", "virus"]  ← Có thể encode!

Benefits:

  • Smaller vocabulary: ~30k tokens thay vì 500k words
  • Handle OOV: Rare words broken thành known subwords
  • Morphology: "love"/"loves"/"loved" share "love" subword → semantic similarity

BPE Algorithm

1. Start with character-level vocabulary: {a, b, c, ..., z}
2. Count all adjacent character pairs
3. Merge most frequent pair vào vocabulary
4. Repeat until vocabulary size = target (30k)

Example:
Initial: "l o w", "l o w e r", "n e w e s t"
Frequency: "l o"=2, "o w"=2, "e r"=1, ...

Merge "l o" → "lo":
"lo w", "lo w e r", "n e w e s t"

Merge "lo w" → "low":
"low", "low e r", "n e w e s t"

Continue...

WordPiece (BERT)

Similar to BPE, nhưng merge based on likelihood thay vì frequency.

Vocabulary: {"un", "##expected", "##ly", ...}

"unexpectedly" → ["un", "##expected", "##ly"]

"##" prefix: Indicate subword continuation (không phải word start).

SentencePiece (GPT, T5)

Treat whitespace như normal character → language-agnostic (work với Chinese, Japanese, v.v.).

"Hello world" → ["▁Hello", "▁world"]
"你好世界" → ["▁你", "好", "世界"]

(▁ = space character)

Special Tokens

[CLS]: Classifier token (đầu sequence - BERT)
[SEP]: Separator (giữa 2 câu)
[PAD]: Padding (make all sequences same length)
[MASK]: Masked token (BERT pre-training)
[UNK]: Unknown token (fallback cho OOV)
<BOS>/<EOS>: Begin/End of sequence (GPT)

Scaling Laws: "Bigger is Better"

Một phát hiện quan trọng về Transformers: Performance scale predictably với size.

Kaplan et al. (2020) Scaling Laws

3 factors ảnh hưởng performance:

  1. Model size (N): Number of parameters
  2. Dataset size (D): Number of tokens
  3. Compute (C): FLOPs used for training

Findings:

Loss ∝ N^(-α) ∝ D^(-β) ∝ C^(-γ)

α ≈ 0.076, β ≈ 0.095, γ ≈ 0.050

Interpretation: Double model size → ~7% loss reduction.

Optimal allocation:

Given compute budget C:
- Model size N and Data size D should be scaled together
- Roughly: N ∝ D^0.74

Don't train huge model on small data, or small model on huge data!

Examples:

GPT-2 (2019):   1.5B parameters, 40GB data
GPT-3 (2020):   175B parameters, 570GB data  ← 100x bigger model
GPT-4 (2023):   ~1.7T parameters (rumored)

Performance improves predictably!

Emergent Capabilities

Surprising finding: Tại certain scale thresholds, models đột nhiên gain new abilities!

Examples:

  • Few-shot learning: Models <1B parameters struggle, >10B excel
  • Chain-of-thought reasoning: Emerge around 100B parameters
  • Following instructions: Strong at 100B+ parameters
Capability
    │           ╱─────── Emerges!
    │         ╱
    │       ╱
    │     ╱
    └────────────────── Model Size
         Critical
         threshold

Three Flavors of Transformers

Encoder-only (BERT-style)

Input
  ↓
[Encoder Stack]
  ↓
Contextual Embeddings

Use cases:

  • Classification (sentiment, spam detection)
  • Named Entity Recognition
  • Question Answering (extract answer from text)

Training: Masked Language Modeling (MLM)

Input: "The [MASK] sat on the mat"
Task: Predict masked word → "cat"

Examples: BERT, RoBERTa, ALBERT

Decoder-only (GPT-style)

Input (text so far)
  ↓
[Decoder Stack]  ← Causal/masked attention
  ↓
Next Token Prediction

Use cases:

  • Text generation (stories, code, emails)
  • Completion
  • Chat

Training: Causal Language Modeling (CLM)

Input: "The cat sat"
Task: Predict next word → "on"

Input: "The cat sat on"
Task: Predict next word → "the"

Examples: GPT-2, GPT-3, GPT-4, Llama

Encoder-Decoder (T5-style)

Input
  ↓
[Encoder Stack]
  ↓
     ↓
[Decoder Stack]
  ↓
Output

Use cases:

  • Translation
  • Summarization
  • Text-to-text tasks

Training: Seq2seq with various objectives

Input: "translate English to French: I love cats"
Output: "J'aime les chats"

Input: "summarize: [long article]"
Output: "Summary: [short summary]"

Examples: T5, BART, mBART

Comparison

Aspect Encoder-only Decoder-only Encoder-Decoder
Architecture Bidirectional Unidirectional Both
Attention Full Causal/Masked Full (enc) + Causal (dec)
Best for Understanding Generation Seq2seq
Examples BERT, RoBERTa GPT-3, Llama T5, BART
Context Can see future Cannot see future Flexible

Training Transformers: Challenges

Memory Requirements

Problem: Self-attention has quadratic complexity!

Sequence length: n
Attention matrix: n × n
Memory: O(n²)

For n=1000: 1,000,000 elements
For n=10000: 100,000,000 elements!

Solutions:

  • Sparse attention: Only attend to subset of tokens
  • Linear attention: Approximate attention in O(n) time
  • Flash Attention: Optimized attention implementation

Computational Cost

GPT-3 training:

Cost: ~$4.6 million (estimated)
Time: Several weeks on thousands of GPUs
Energy: ~1,287 MWh (equivalent to 120 US homes for 1 year)
CO₂: ~552 tons

Solutions:

  • Efficient architectures: Distillation, pruning, quantization
  • Better hardware: TPUs, specialized AI chips
  • Distributed training: Split across many machines

Data Requirements

GPT-3: 570GB text (~300 billion tokens)
Chinchilla (70B params): 1.4TB text (~1.4 trillion tokens)

Rule of thumb: Need ~20 tokens per parameter
For 175B model → ~3.5 trillion tokens!

Data quality >> Data quantity - curated data trains better models.

Key Takeaways

  • Self-Attention: Core mechanism - model learns which parts of input are relevant
  • Multi-Head Attention: Multiple "views" of relationships simultaneously
  • Encoder-Decoder: Classic architecture, but modern models simplify (encoder-only or decoder-only)
  • Positional Encoding: Add position information (order matters!)
  • Tokenization: Subword tokenization (BPE, WordPiece, SentencePiece) solves OOV problem
  • Scaling Laws: Performance scales predictably with model/data size
  • Three flavors:
    • Encoder-only (BERT): Understanding
    • Decoder-only (GPT): Generation
    • Encoder-Decoder (T5): Seq2seq
  • Challenges: Quadratic memory, high compute cost, massive data needs

Why Transformers won:

  • Parallelizable: Train much faster than RNNs
  • Long-range dependencies: No vanishing gradients
  • Flexible: Same architecture for vision, audio, multimodal
  • Scalable: Bigger = better (predictably)

Next steps:

  • Implement attention from scratch để hiểu deeply
  • Experiment với pre-trained models (HuggingFace)
  • Understand different attention patterns (causal, bidirectional)

Trong bài tiếp theo, chúng ta sẽ khám phá Prompt Engineering - nghệ thuật communicate với LLMs để extract maximum value.


Bài viết thuộc series "From Zero to AI Engineer" - Module 6: LLM Architectures