Năm 2017, một paper từ Google với cái tên ngạo nghễ "Attention is All You Need" đã thay đổi hoàn toàn AI. Không phải vì nó cải thiện vài % accuracy, mà vì nó phá vỡ paradigm của NLP trong 20+ năm.
Trước Transformer, state-of-the-art NLP models đều dựa trên Recurrent Neural Networks (RNNs) - process text tuần tự từ trái sang phải. Nhưng RNNs có vấn đề:
Transformer giải quyết TẤT CẢ bằng một cơ chế: Self-Attention - cho phép model "nhìn" toàn bộ input cùng lúc và tự quyết định phần nào quan trọng.
Kết quả? GPT, BERT, T5, và mọi LLM hiện đại đều dựa trên Transformer architecture.
Ví dụ: Câu "The animal didn't cross the street because it was too tired."
Câu hỏi: "it" refer đến "animal" hay "street"?
Với con người, chúng ta biết:
Self-Attention cho phép model học relationship này:
"it" attends to:
- "animal": High attention (0.8) ← Strong relation
- "street": Low attention (0.1)
- "tired": Medium attention (0.3)
Goal: Với mỗi word, tính weighted sum của tất cả words trong sequence dựa trên "relevance".
Ví dụ đơn giản: Câu "The cat sat"
Step 1: Create Query, Key, Value vectors
Mỗi word được represent bằng 3 vectors:
Word Embeddings (giả sử dim=4):
"The": [1, 0, 0, 0]
"cat": [0, 1, 0, 0]
"sat": [0, 0, 1, 0]
Linear transformations (học được):
Q = Embedding × W_Q
K = Embedding × W_K
V = Embedding × W_V
Step 2: Calculate Attention Scores
Với mỗi word, tính similarity với tất cả words (bao gồm chính nó).
Score(word_i, word_j) = Q_i · K_j (dot product)
For "cat":
Q_cat · K_The = 0.2
Q_cat · K_cat = 0.9 ← High (attend to itself)
Q_cat · K_sat = 0.6
Step 3: Normalize với Softmax
Attention_weights = softmax(scores / √d_k)
√d_k: Scaling factor (tránh gradients quá nhỏ)
After softmax:
"cat" attends to:
- "The": 0.15
- "cat": 0.60 ← Most attention to itself
- "sat": 0.25
Total = 1.0
Step 4: Weighted Sum of Values
Output_cat = 0.15 × V_The + 0.60 × V_cat + 0.25 × V_sat
Result: Representation của "cat" bây giờ có context từ cả câu!
Problem: Single attention chỉ capture 1 type of relationship.
Ví dụ: "The cat sat on the mat"
Multi-Head Attention: Run multiple attention heads song song, concat outputs.
Input
↓
[Head 1] [Head 2] [Head 3] ... [Head 8]
↓ ↓ ↓ ↓
Concat → Linear → Output
Ví dụ với 8 heads:
# Each head has separate Q, K, V matrices
for i in range(8):
Q_i = input @ W_Q_i
K_i = input @ W_K_i
V_i = input @ W_V_i
attention_i = softmax(Q_i @ K_i.T / sqrt(d_k)) @ V_i
# Concatenate all heads
multi_head_output = concat([attention_1, ..., attention_8])
output = multi_head_output @ W_O
Benefit: Model học được diverse relationships simultaneously.
Input Sentence Output Sentence
↓ ↑
[Embedding] [Embedding]
↓ ↑
[Positional [Positional
Encoding] Encoding]
↓ ↑
┌─────────────┐ ┌─────────────┐
│ ENCODER │ │ DECODER │
│ (6 layers) │──────────────│ (6 layers) │
│ │ Attention │ │
└─────────────┘ └─────────────┘
↓
[Linear + Softmax]
↓
Predicted Tokens
Mỗi Encoder layer:
Input
↓
[Multi-Head Self-Attention]
↓
[Add & Normalize] ← Residual connection + Layer Norm
↓
[Feed-Forward Network]
↓
[Add & Normalize]
↓
Output
Feed-Forward Network:
FFN(x) = ReLU(x W₁ + b₁) W₂ + b₂
Typically:
- Input dim: 512
- Hidden dim: 2048 (4x expansion)
- Output dim: 512
Layer Normalization:
LayerNorm(x) = γ × (x - μ) / σ + β
Normalize across features (not batch) cho mỗi sample
→ Stabilize training
Residual Connections:
Output = LayerNorm(x + MultiHeadAttention(x))
Output = LayerNorm(x + FFN(x))
→ Gradient flow tốt hơn, train deep networks dễ hơn
Mỗi Decoder layer:
Input (shifted right)
↓
[Masked Multi-Head Self-Attention] ← Can't see future tokens
↓
[Add & Normalize]
↓
[Multi-Head Cross-Attention] ← Attend to Encoder outputs
↓
[Add & Normalize]
↓
[Feed-Forward Network]
↓
[Add & Normalize]
↓
Output
Key differences vs Encoder:
1. Masked Self-Attention:
Input: "I love cats"
Predicting: "<start> I love cats"
When predicting "love":
✓ Can see: "<start>", "I"
✗ Cannot see: "love", "cats" ← Future tokens masked
Attention mask:
<start> I love cats
<start> 1 0 0 0
I 1 1 0 0
love 1 1 1 0
cats 1 1 1 1
(1 = can attend, 0 = masked)
Tại sao mask? Để prevent "cheating" during training (model không được nhìn answer).
2. Cross-Attention:
Decoder attend to Encoder outputs.
Queries: từ Decoder
Keys, Values: từ Encoder
→ Decoder focus vào relevant parts của input sentence
Ví dụ: Translation
Input (Encoder): "I love cats"
Output (Decoder): "J'aime les chats"
Khi predict "chats" (cats):
Cross-attention weights:
- "I": 0.1
- "love": 0.1
- "cats": 0.8 ← High attention!
Problem: Self-attention không có khái niệm về order/position.
"The cat sat" vs "sat cat The"
→ Same attention scores! (order-invariant)
Solution: Thêm positional information vào embeddings.
Sinusoidal Positional Encoding:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
pos: Position trong sequence (0, 1, 2, ...)
i: Dimension index
d_model: Embedding dimension (512)
Ví dụ:
Position 0: [sin(0/10000^0), cos(0/10000^0), sin(0/10000^(2/512)), ...]
Position 1: [sin(1/10000^0), cos(1/10000^0), sin(1/10000^(2/512)), ...]
...
Properties:
Alternative: Learned positional embeddings (GPT-2, BERT dùng cách này).
Problem: Transformers chỉ hiểu numbers, không hiểu text.
Vocabulary: {"I": 1, "love": 2, "cats": 3, "dogs": 4, ...}
Sentence: "I love cats"
Tokens: [1, 2, 3]
Problems:
Idea: Break words thành subword units.
Ví dụ với BPE (Byte Pair Encoding):
Vocabulary: {"I", "love", "cat", "s", "dog", "##s", ...}
"I love cats" → ["I", "love", "cat", "##s"]
"I love dogs" → ["I", "love", "dog", "##s"]
"coronavirus" (OOV) → ["corona", "virus"] ← Có thể encode!
Benefits:
1. Start with character-level vocabulary: {a, b, c, ..., z}
2. Count all adjacent character pairs
3. Merge most frequent pair vào vocabulary
4. Repeat until vocabulary size = target (30k)
Example:
Initial: "l o w", "l o w e r", "n e w e s t"
Frequency: "l o"=2, "o w"=2, "e r"=1, ...
Merge "l o" → "lo":
"lo w", "lo w e r", "n e w e s t"
Merge "lo w" → "low":
"low", "low e r", "n e w e s t"
Continue...
Similar to BPE, nhưng merge based on likelihood thay vì frequency.
Vocabulary: {"un", "##expected", "##ly", ...}
"unexpectedly" → ["un", "##expected", "##ly"]
"##" prefix: Indicate subword continuation (không phải word start).
Treat whitespace như normal character → language-agnostic (work với Chinese, Japanese, v.v.).
"Hello world" → ["▁Hello", "▁world"]
"你好世界" → ["▁你", "好", "世界"]
(▁ = space character)
[CLS]: Classifier token (đầu sequence - BERT)
[SEP]: Separator (giữa 2 câu)
[PAD]: Padding (make all sequences same length)
[MASK]: Masked token (BERT pre-training)
[UNK]: Unknown token (fallback cho OOV)
<BOS>/<EOS>: Begin/End of sequence (GPT)
Một phát hiện quan trọng về Transformers: Performance scale predictably với size.
3 factors ảnh hưởng performance:
Findings:
Loss ∝ N^(-α) ∝ D^(-β) ∝ C^(-γ)
α ≈ 0.076, β ≈ 0.095, γ ≈ 0.050
Interpretation: Double model size → ~7% loss reduction.
Optimal allocation:
Given compute budget C:
- Model size N and Data size D should be scaled together
- Roughly: N ∝ D^0.74
Don't train huge model on small data, or small model on huge data!
Examples:
GPT-2 (2019): 1.5B parameters, 40GB data
GPT-3 (2020): 175B parameters, 570GB data ← 100x bigger model
GPT-4 (2023): ~1.7T parameters (rumored)
Performance improves predictably!
Surprising finding: Tại certain scale thresholds, models đột nhiên gain new abilities!
Examples:
Capability
│ ╱─────── Emerges!
│ ╱
│ ╱
│ ╱
└────────────────── Model Size
Critical
threshold
Input
↓
[Encoder Stack]
↓
Contextual Embeddings
Use cases:
Training: Masked Language Modeling (MLM)
Input: "The [MASK] sat on the mat"
Task: Predict masked word → "cat"
Examples: BERT, RoBERTa, ALBERT
Input (text so far)
↓
[Decoder Stack] ← Causal/masked attention
↓
Next Token Prediction
Use cases:
Training: Causal Language Modeling (CLM)
Input: "The cat sat"
Task: Predict next word → "on"
Input: "The cat sat on"
Task: Predict next word → "the"
Examples: GPT-2, GPT-3, GPT-4, Llama
Input
↓
[Encoder Stack]
↓
↓
[Decoder Stack]
↓
Output
Use cases:
Training: Seq2seq with various objectives
Input: "translate English to French: I love cats"
Output: "J'aime les chats"
Input: "summarize: [long article]"
Output: "Summary: [short summary]"
Examples: T5, BART, mBART
| Aspect | Encoder-only | Decoder-only | Encoder-Decoder |
|---|---|---|---|
| Architecture | Bidirectional | Unidirectional | Both |
| Attention | Full | Causal/Masked | Full (enc) + Causal (dec) |
| Best for | Understanding | Generation | Seq2seq |
| Examples | BERT, RoBERTa | GPT-3, Llama | T5, BART |
| Context | Can see future | Cannot see future | Flexible |
Problem: Self-attention has quadratic complexity!
Sequence length: n
Attention matrix: n × n
Memory: O(n²)
For n=1000: 1,000,000 elements
For n=10000: 100,000,000 elements!
Solutions:
GPT-3 training:
Cost: ~$4.6 million (estimated)
Time: Several weeks on thousands of GPUs
Energy: ~1,287 MWh (equivalent to 120 US homes for 1 year)
CO₂: ~552 tons
Solutions:
GPT-3: 570GB text (~300 billion tokens)
Chinchilla (70B params): 1.4TB text (~1.4 trillion tokens)
Rule of thumb: Need ~20 tokens per parameter
For 175B model → ~3.5 trillion tokens!
Data quality >> Data quantity - curated data trains better models.
Why Transformers won:
Next steps:
Trong bài tiếp theo, chúng ta sẽ khám phá Prompt Engineering - nghệ thuật communicate với LLMs để extract maximum value.
Bài viết thuộc series "From Zero to AI Engineer" - Module 6: LLM Architectures