RAG giúp LLMs truy cập external knowledge, nhưng đôi khi bạn cần LLM thay đổi behavior - viết theo style cụ thể, hiểu domain-specific terminology, hoặc follow instructions tốt hơn. Đó là lúc fine-tuning phát huy tác dụng.
Trong bài này, chúng ta sẽ khám phá các kỹ thuật fine-tuning từ full fine-tuning đến parameter-efficient methods như LoRA và QLoRA.
Update tất cả parameters của model.
# Pseudocode
for epoch in range(num_epochs):
for batch in training_data:
# Forward pass
outputs = model(batch.inputs)
loss = compute_loss(outputs, batch.labels)
# Backward pass - update ALL 7B parameters
loss.backward()
optimizer.step()
Ưu điểm:
Nhược điểm:
Khi nào dùng: Research, unlimited resources, need maximum performance
Freeze hầu hết parameters, chỉ train một phần nhỏ.
# Pseudocode
# Freeze base model
for param in base_model.parameters():
param.requires_grad = False
# Add small trainable modules
adapter = LoRAAdapter(rank=8) # Only ~0.1% of parameters
# Train only adapter
for batch in training_data:
outputs = model_with_adapter(batch.inputs)
loss = compute_loss(outputs, batch.labels)
loss.backward() # Only adapter gradients
optimizer.step()
Ưu điểm:
Nhược điểm:
LoRA là PEFT technique phổ biến nhất.
Thay vì update weight matrix W (dimension d × d), LoRA thêm low-rank decomposition:
Original: W ∈ ℝ^(d×d)
LoRA: W' = W + BA
- W: Frozen original weights (d × d)
- B: Trainable (d × r)
- A: Trainable (r × d)
- r << d (rank, thường 4-64)
Parameters to train: d×r + r×d = 2dr
Instead of: d×d
Example: d=4096, r=8
- Full: 4096×4096 = 16M parameters
- LoRA: 2×4096×8 = 65K parameters (~0.4%)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=8, # Rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(base_model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06%
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True, # Mixed precision
logging_steps=10,
save_steps=100
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
# Train
trainer.train()
# Save LoRA adapter (only ~10MB instead of 14GB!)
model.save_pretrained("./lora-adapter")
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"./lora-adapter"
)
# Generate
inputs = tokenizer("Explain machine learning:", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
r (rank):
lora_alpha:
target_modules:
# Minimal (fastest)
target_modules=["q_proj", "v_proj"]
# Recommended (balanced)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
# Extensive (best performance)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
QLoRA combines LoRA với quantization để giảm memory hơn nữa.
Giảm precision của weights:
FP32: 32 bits per parameter (4 bytes)
FP16: 16 bits per parameter (2 bytes) - 50% reduction
INT8: 8 bits per parameter (1 byte) - 75% reduction
INT4: 4 bits per parameter (0.5 byte)- 87.5% reduction
QLoRA innovation:
7B model:
- FP32: ~28GB
- FP16: ~14GB
- LoRA (FP16): ~14GB base + 100MB adapter
- QLoRA (4-bit): ~3.5GB base + 100MB adapter ✨
Can fine-tune 7B on 1x RTX 3090 (24GB)!
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_use_double_quant=True, # Double quantization
bnb_4bit_compute_dtype=torch.float16
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# Add LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train normally
trainer.train()
| Metric | LoRA | QLoRA |
|---|---|---|
| Memory | 14GB (7B) | 3.5GB (7B) |
| Speed | Faster | ~20% slower |
| Quality | Slightly better | Very close (~1-2% diff) |
| Use case | GPU với đủ RAM | Consumer GPUs |
Thêm small neural network modules vào model.
from peft import AdapterConfig
adapter_config = AdapterConfig(
reduction_factor=16 # Bottleneck dimension
)
model = get_peft_model(base_model, adapter_config)
Architecture:
Layer Output
↓
Adapter Module:
↓ Down-project (d → d/16)
↓ Activation
↓ Up-project (d/16 → d)
↓ Skip connection
↓
Next Layer
Ưu điểm:
Nhược điểm:
Khi fine-tune, model có thể "quên" original capabilities.
# Before fine-tuning on medical data
model("What is photosynthesis?")
# "Photosynthesis is the process plants use to convert sunlight..."
# After fine-tuning
model("What is photosynthesis?")
# "Photosynthesis is... [medical jargon] ...cells... [confused response]"
1. Mix original data:
# Fine-tuning dataset
train_data = domain_specific_data + general_data_sample
# Ratio: 80% domain, 20% general
2. Lower learning rate:
# Too aggressive: 5e-4
# Better: 1e-4 or 2e-5
training_args = TrainingArguments(
learning_rate=2e-5 # Gentle adaptation
)
3. Use PEFT (LoRA):
4. Regularization:
# Elastic Weight Consolidation (EWC)
# Penalize changes to important weights
Fine-tuning for instruction-following requires proper formatting.
{
"instruction": "Summarize the following article:",
"input": "Long article text here...",
"output": "Summary of the article..."
}
Alpaca format:
def format_alpaca(example):
if example["input"]:
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{example["instruction"]}
### Input:
{example["input"]}
### Response:
{example["output"]}"""
else:
prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{example["instruction"]}
### Response:
{example["output"]}"""
return {"text": prompt}
# ❌ BAD - Low quality
{
"instruction": "tell me about dogs",
"output": "dogs r cool"
}
# ✅ GOOD - High quality
{
"instruction": "Provide a comprehensive overview of dog breeds suitable for apartment living.",
"output": "Several dog breeds adapt well to apartment living due to their size and temperament. Small breeds like Cavalier King Charles Spaniels are gentle and require moderate exercise..."
}
Dataset quality > quantity:
from openai import OpenAI
client = OpenAI()
def generate_training_examples(topic, num_examples=100):
examples = []
for _ in range(num_examples):
response = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": f"""Generate a training example for fine-tuning an AI assistant specialized in {topic}.
Format:
- Instruction: A question or task
- Input: Optional context (can be empty)
- Output: Detailed, accurate response
Return as JSON."""
}]
)
example = json.loads(response.choices[0].message.content)
examples.append(example)
return examples
# Generate data
medical_data = generate_training_examples("medical diagnosis", 500)
Base model (e.g., Llama-2-7b):
Instruction-tuned (e.g., Llama-2-7b-chat):
# For chatbot
base = "meta-llama/Llama-2-7b-chat-hf" # Recommended
# For domain modeling
base = "meta-llama/Llama-2-7b-hf"
from transformers import TrainerCallback
class OverfitCallback(TrainerCallback):
def on_evaluate(self, args, state, control, metrics, **kwargs):
train_loss = state.log_history[-2]["loss"]
eval_loss = metrics["eval_loss"]
if eval_loss > train_loss * 1.5:
print("⚠️ Possible overfitting!")
return control
trainer = Trainer(
model=model,
args=training_args,
callbacks=[OverfitCallback()]
)
# Don't just look at loss
# Test on held-out examples
test_prompts = [
"Explain quantum computing:",
"Write a haiku about spring:",
"Debug this Python code: ..."
]
for prompt in test_prompts:
output = generate(model, prompt)
print(f"Prompt: {prompt}")
print(f"Output: {output}\n")
from transformers import get_linear_schedule_with_warmup
# Warmup: Gradually increase LR
# Decay: Gradually decrease LR
optimizer = AdamW(model.parameters(), lr=2e-4)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=100,
num_training_steps=1000
)
Train on larger effective batch size với limited GPU memory:
training_args = TrainingArguments(
per_device_train_batch_size=1, # Physical batch size
gradient_accumulation_steps=16, # Accumulate 16 batches
# Effective batch size = 1 × 16 = 16
)
Combine base model + adapter into single model:
from peft import PeftModel
# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
lora_model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Merge
merged_model = lora_model.merge_and_unload()
# Save as standard model
merged_model.save_pretrained("./merged-model")
When to merge:
When NOT to merge:
Trong bài tiếp theo, chúng ta sẽ khám phá AI Safety & Alignment - hallucination, RLHF, evaluation frameworks cho LLMs.
Bài viết thuộc series "From Zero to AI Engineer" - Module 8: Advanced GenAI Topics