Transformer Architecture: Cuộc Cách mạng "Attention is All You Need"

Năm 2017, một paper từ Google với cái tên ngạo nghễ "Attention is All You Need" đã thay đổi hoàn toàn AI. Không phải vì nó cải thiện vài % accuracy, mà vì nó phá vỡ paradigm của NLP trong 20+ năm.

Trước Transformer, state-of-the-art NLP models đều dựa trên Recurrent Neural Networks (RNNs) - process text tuần tự từ trái sang phải. Nhưng RNNs có vấn đề:

Chậm: Không thể parallelize (phải đợi word trước xong mới process word sau)
Vanishing gradients: Khó học long-range dependencies
Limited context: Forget thông tin từ đầu câu khi đến cuối câu

Transformer giải quyết TẤT CẢ bằng một cơ chế: Self-Attention - cho phép model "nhìn" toàn bộ input cùng lúc và tự quyết định phần nào quan trọng.

Kết quả? GPT, BERT, T5, và mọi LLM hiện đại đều dựa trên Transformer architecture.

Self-Attention: "Nhìn" vào Chính Mình

Intuition

Ví dụ: Câu "The animal didn't cross the street because it was too tired."

Câu hỏi: "it" refer đến "animal" hay "street"?

Với con người, chúng ta biết:

"tired" → more likely về "animal" (animals get tired, streets don't)
"it" → refers to "animal"

Self-Attention cho phép model học relationship này:

"it" attends to:
- "animal": High attention (0.8) ← Strong relation
- "street": Low attention (0.1)
- "tired": Medium attention (0.3)

Self-Attention Mechanism

Goal: Với mỗi word, tính weighted sum của tất cả words trong sequence dựa trên "relevance".

Ví dụ đơn giản: Câu "The cat sat"

Step 1: Create Query, Key, Value vectors

Mỗi word được represent bằng 3 vectors:

Query (Q): "What I'm looking for"
Key (K): "What I offer"
Value (V): "What I actually contain"

Word Embeddings (giả sử dim=4):
"The":  [1, 0, 0, 0]
"cat":  [0, 1, 0, 0]
"sat":  [0, 0, 1, 0]

Linear transformations (học được):
Q = Embedding × W_Q
K = Embedding × W_K
V = Embedding × W_V

Step 2: Calculate Attention Scores

Với mỗi word, tính similarity với tất cả words (bao gồm chính nó).

Score(word_i, word_j) = Q_i · K_j  (dot product)

For "cat":
Q_cat · K_The = 0.2
Q_cat · K_cat = 0.9  ← High (attend to itself)
Q_cat · K_sat = 0.6

Step 3: Normalize với Softmax

Attention_weights = softmax(scores / √d_k)

√d_k: Scaling factor (tránh gradients quá nhỏ)

After softmax:
"cat" attends to:
- "The": 0.15
- "cat": 0.60  ← Most attention to itself
- "sat": 0.25
Total = 1.0

Step 4: Weighted Sum of Values

Output_cat = 0.15 × V_The + 0.60 × V_cat + 0.25 × V_sat

Result: Representation của "cat" bây giờ có context từ cả câu!

Multi-Head Attention: "Nhiều Góc Nhìn"

Problem: Single attention chỉ capture 1 type of relationship.

Ví dụ: "The cat sat on the mat"

Attention head 1: Focus on subject-verb (cat ↔ sat)
Attention head 2: Focus on object relationship (sat ↔ mat)
Attention head 3: Focus on article-noun (the ↔ cat)

Multi-Head Attention: Run multiple attention heads song song, concat outputs.

Input
  ↓
[Head 1] [Head 2] [Head 3] ... [Head 8]
  ↓        ↓        ↓             ↓
Concat → Linear → Output

Ví dụ với 8 heads:

# Each head has separate Q, K, V matrices
for i in range(8):
    Q_i = input @ W_Q_i
    K_i = input @ W_K_i
    V_i = input @ W_V_i
    
    attention_i = softmax(Q_i @ K_i.T / sqrt(d_k)) @ V_i

# Concatenate all heads
multi_head_output = concat([attention_1, ..., attention_8])
output = multi_head_output @ W_O

Benefit: Model học được diverse relationships simultaneously.

Transformer Architecture: Encoder-Decoder

Full Architecture

Input Sentence                Output Sentence
    ↓                              ↑
[Embedding]                   [Embedding]
    ↓                              ↑
[Positional                   [Positional
 Encoding]                     Encoding]
    ↓                              ↑
┌─────────────┐              ┌─────────────┐
│  ENCODER    │              │  DECODER    │
│  (6 layers) │──────────────│  (6 layers) │
│             │   Attention  │             │
└─────────────┘              └─────────────┘
                                  ↓
                             [Linear + Softmax]
                                  ↓
                            Predicted Tokens

Encoder Stack

Mỗi Encoder layer:

Input
  ↓
[Multi-Head Self-Attention]
  ↓
[Add & Normalize]  ← Residual connection + Layer Norm
  ↓
[Feed-Forward Network]
  ↓
[Add & Normalize]
  ↓
Output

Feed-Forward Network:

FFN(x) = ReLU(x W₁ + b₁) W₂ + b₂

Typically: 
- Input dim: 512
- Hidden dim: 2048 (4x expansion)
- Output dim: 512

Layer Normalization:

LayerNorm(x) = γ × (x - μ) / σ + β

Normalize across features (not batch) cho mỗi sample
→ Stabilize training

Residual Connections:

Output = LayerNorm(x + MultiHeadAttention(x))
Output = LayerNorm(x + FFN(x))

→ Gradient flow tốt hơn, train deep networks dễ hơn

Decoder Stack

Mỗi Decoder layer:

Input (shifted right)
  ↓
[Masked Multi-Head Self-Attention]  ← Can't see future tokens
  ↓
[Add & Normalize]
  ↓
[Multi-Head Cross-Attention]  ← Attend to Encoder outputs
  ↓
[Add & Normalize]
  ↓
[Feed-Forward Network]
  ↓
[Add & Normalize]
  ↓
Output

Key differences vs Encoder:

1. Masked Self-Attention:

Input: "I love cats"
Predicting: "<start> I love cats"

When predicting "love":
✓ Can see: "<start>", "I"
✗ Cannot see: "love", "cats"  ← Future tokens masked

Attention mask:
     <start>  I  love  cats
<start>  1    0    0    0
I        1    1    0    0
love     1    1    1    0
cats     1    1    1    1

(1 = can attend, 0 = masked)

Tại sao mask? Để prevent "cheating" during training (model không được nhìn answer).

2. Cross-Attention:

Decoder attend to Encoder outputs.

Queries: từ Decoder
Keys, Values: từ Encoder

→ Decoder focus vào relevant parts của input sentence

Ví dụ: Translation

Input (Encoder): "I love cats"
Output (Decoder): "J'aime les chats"

Khi predict "chats" (cats):
Cross-attention weights:
- "I": 0.1
- "love": 0.1
- "cats": 0.8  ← High attention!

Positional Encoding: Thêm Thông tin Vị trí

Problem: Self-attention không có khái niệm về order/position.

"The cat sat" vs "sat cat The" 
→ Same attention scores! (order-invariant)

Solution: Thêm positional information vào embeddings.

Sinusoidal Positional Encoding:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

pos: Position trong sequence (0, 1, 2, ...)
i: Dimension index
d_model: Embedding dimension (512)

Ví dụ:

Position 0: [sin(0/10000^0), cos(0/10000^0), sin(0/10000^(2/512)), ...]
Position 1: [sin(1/10000^0), cos(1/10000^0), sin(1/10000^(2/512)), ...]
...

Properties:

Deterministic (không cần learn)
Unique cho mỗi position
Relative positions có consistent patterns

Alternative: Learned positional embeddings (GPT-2, BERT dùng cách này).

Tokenization: Chuyển Text thành Numbers

Problem: Transformers chỉ hiểu numbers, không hiểu text.

Word-level Tokenization (Naive)

Vocabulary: {"I": 1, "love": 2, "cats": 3, "dogs": 4, ...}

Sentence: "I love cats"
Tokens: [1, 2, 3]

Problems:

Huge vocabulary: English có ~500k words → 500k vocab size!
OOV (Out of Vocabulary): Gặp word mới (VD: "coronavirus" năm 2019) → không biết encode
No morphological understanding: "love", "loves", "loved", "loving" → 4 tokens riêng biệt

Subword Tokenization: The Solution

Idea: Break words thành subword units.

Ví dụ với BPE (Byte Pair Encoding):

Vocabulary: {"I", "love", "cat", "s", "dog", "##s", ...}

"I love cats" → ["I", "love", "cat", "##s"]
"I love dogs" → ["I", "love", "dog", "##s"]

"coronavirus" (OOV) → ["corona", "virus"]  ← Có thể encode!

Benefits:

Smaller vocabulary: ~30k tokens thay vì 500k words
Handle OOV: Rare words broken thành known subwords
Morphology: "love"/"loves"/"loved" share "love" subword → semantic similarity

BPE Algorithm

1. Start with character-level vocabulary: {a, b, c, ..., z}
2. Count all adjacent character pairs
3. Merge most frequent pair vào vocabulary
4. Repeat until vocabulary size = target (30k)

Example:
Initial: "l o w", "l o w e r", "n e w e s t"
Frequency: "l o"=2, "o w"=2, "e r"=1, ...

Merge "l o" → "lo":
"lo w", "lo w e r", "n e w e s t"

Merge "lo w" → "low":
"low", "low e r", "n e w e s t"

Continue...

WordPiece (BERT)

Similar to BPE, nhưng merge based on likelihood thay vì frequency.

Vocabulary: {"un", "##expected", "##ly", ...}

"unexpectedly" → ["un", "##expected", "##ly"]

"##" prefix: Indicate subword continuation (không phải word start).

SentencePiece (GPT, T5)

Treat whitespace như normal character → language-agnostic (work với Chinese, Japanese, v.v.).

"Hello world" → ["▁Hello", "▁world"]
"你好世界" → ["▁你", "好", "世界"]

(▁ = space character)

Special Tokens

[CLS]: Classifier token (đầu sequence - BERT)
[SEP]: Separator (giữa 2 câu)
[PAD]: Padding (make all sequences same length)
[MASK]: Masked token (BERT pre-training)
[UNK]: Unknown token (fallback cho OOV)
<BOS>/<EOS>: Begin/End of sequence (GPT)

Scaling Laws: "Bigger is Better"

Một phát hiện quan trọng về Transformers: Performance scale predictably với size.

Kaplan et al. (2020) Scaling Laws

3 factors ảnh hưởng performance:

Model size (N): Number of parameters
Dataset size (D): Number of tokens
Compute (C): FLOPs used for training

Findings:

Loss ∝ N^(-α) ∝ D^(-β) ∝ C^(-γ)

α ≈ 0.076, β ≈ 0.095, γ ≈ 0.050

Interpretation: Double model size → ~7% loss reduction.

Optimal allocation:

Given compute budget C:
- Model size N and Data size D should be scaled together
- Roughly: N ∝ D^0.74

Don't train huge model on small data, or small model on huge data!

Examples:

GPT-2 (2019):   1.5B parameters, 40GB data
GPT-3 (2020):   175B parameters, 570GB data  ← 100x bigger model
GPT-4 (2023):   ~1.7T parameters (rumored)

Performance improves predictably!

Emergent Capabilities

Surprising finding: Tại certain scale thresholds, models đột nhiên gain new abilities!

Examples:

Few-shot learning: Models <1B parameters struggle, >10B excel
Chain-of-thought reasoning: Emerge around 100B parameters
Following instructions: Strong at 100B+ parameters

Capability
    │           ╱─────── Emerges!
    │         ╱
    │       ╱
    │     ╱
    └────────────────── Model Size
         Critical
         threshold

Three Flavors of Transformers

Encoder-only (BERT-style)

Input
  ↓
[Encoder Stack]
  ↓
Contextual Embeddings

Use cases:

Classification (sentiment, spam detection)
Named Entity Recognition
Question Answering (extract answer from text)

Training: Masked Language Modeling (MLM)

Input: "The [MASK] sat on the mat"
Task: Predict masked word → "cat"

Examples: BERT, RoBERTa, ALBERT

Decoder-only (GPT-style)

Input (text so far)
  ↓
[Decoder Stack]  ← Causal/masked attention
  ↓
Next Token Prediction

Use cases:

Text generation (stories, code, emails)
Completion
Chat

Training: Causal Language Modeling (CLM)

Input: "The cat sat"
Task: Predict next word → "on"

Input: "The cat sat on"
Task: Predict next word → "the"

Examples: GPT-2, GPT-3, GPT-4, Llama

Encoder-Decoder (T5-style)

Input
  ↓
[Encoder Stack]
  ↓
     ↓
[Decoder Stack]
  ↓
Output

Use cases:

Translation
Summarization
Text-to-text tasks

Training: Seq2seq with various objectives

Input: "translate English to French: I love cats"
Output: "J'aime les chats"

Input: "summarize: [long article]"
Output: "Summary: [short summary]"

Examples: T5, BART, mBART

Comparison

Aspect	Encoder-only	Decoder-only	Encoder-Decoder
Architecture	Bidirectional	Unidirectional	Both
Attention	Full	Causal/Masked	Full (enc) + Causal (dec)
Best for	Understanding	Generation	Seq2seq
Examples	BERT, RoBERTa	GPT-3, Llama	T5, BART
Context	Can see future	Cannot see future	Flexible

Training Transformers: Challenges

Memory Requirements

Problem: Self-attention has quadratic complexity!

Sequence length: n
Attention matrix: n × n
Memory: O(n²)

For n=1000: 1,000,000 elements
For n=10000: 100,000,000 elements!

Solutions:

Sparse attention: Only attend to subset of tokens
Linear attention: Approximate attention in O(n) time
Flash Attention: Optimized attention implementation

Computational Cost

GPT-3 training:

Cost: ~$4.6 million (estimated)
Time: Several weeks on thousands of GPUs
Energy: ~1,287 MWh (equivalent to 120 US homes for 1 year)
CO₂: ~552 tons

Solutions:

Efficient architectures: Distillation, pruning, quantization
Better hardware: TPUs, specialized AI chips
Distributed training: Split across many machines

Data Requirements

GPT-3: 570GB text (~300 billion tokens)
Chinchilla (70B params): 1.4TB text (~1.4 trillion tokens)

Rule of thumb: Need ~20 tokens per parameter
For 175B model → ~3.5 trillion tokens!

Data quality >> Data quantity - curated data trains better models.

Key Takeaways

Self-Attention: Core mechanism - model learns which parts of input are relevant
Multi-Head Attention: Multiple "views" of relationships simultaneously
Encoder-Decoder: Classic architecture, but modern models simplify (encoder-only or decoder-only)
Positional Encoding: Add position information (order matters!)
Tokenization: Subword tokenization (BPE, WordPiece, SentencePiece) solves OOV problem
Scaling Laws: Performance scales predictably with model/data size
Three flavors:
- Encoder-only (BERT): Understanding
- Decoder-only (GPT): Generation
- Encoder-Decoder (T5): Seq2seq
Challenges: Quadratic memory, high compute cost, massive data needs

Why Transformers won:

Parallelizable: Train much faster than RNNs
Long-range dependencies: No vanishing gradients
Flexible: Same architecture for vision, audio, multimodal
Scalable: Bigger = better (predictably)

Next steps:

Implement attention from scratch để hiểu deeply
Experiment với pre-trained models (HuggingFace)
Understand different attention patterns (causal, bidirectional)

Trong bài tiếp theo, chúng ta sẽ khám phá Prompt Engineering - nghệ thuật communicate với LLMs để extract maximum value.

Bài viết thuộc series "From Zero to AI Engineer" - Module 6: LLM Architectures

Transformer Architecture: Cuộc Cách mạng "Attention is All You Need"

Self-Attention: "Nhìn" vào Chính Mình

Intuition

Self-Attention Mechanism

Multi-Head Attention: "Nhiều Góc Nhìn"

Transformer Architecture: Encoder-Decoder

Full Architecture

Encoder Stack

Decoder Stack

Positional Encoding: Thêm Thông tin Vị trí

Tokenization: Chuyển Text thành Numbers

Word-level Tokenization (Naive)

Subword Tokenization: The Solution

BPE Algorithm

WordPiece (BERT)

SentencePiece (GPT, T5)

Special Tokens

Scaling Laws: "Bigger is Better"

Kaplan et al. (2020) Scaling Laws

Emergent Capabilities

Three Flavors of Transformers

Encoder-only (BERT-style)

Decoder-only (GPT-style)

Encoder-Decoder (T5-style)

Comparison

Training Transformers: Challenges

Memory Requirements

Computational Cost

Data Requirements

Key Takeaways

Mục Lục