Trong bài trước, chúng ta đã học về vector search và embeddings - nền tảng của RAG. Nhưng việc chỉ embed toàn bộ document và search thôi thường cho kết quả kém. Trong bài này, chúng ta sẽ khám phá các kỹ thuật nâng cao để improve retrieval quality - yếu tố quyết định thành bại của RAG systems.
Một RAG system tốt cần trả lời được: "Làm sao retrieve đúng context với đúng granularity cho LLM?"
Chunking là quá trình chia documents dài thành các đoạn nhỏ hơn để embed và search.
Vấn đề với whole document:
# Document 10,000 words về "Machine Learning"
# User query: "What is gradient descent?"
# Nếu embed toàn bộ document:
# - Vector chứa thông tin về tất cả topics trong doc
# - Diluted semantics - "gradient descent" chỉ chiếm 2% content
# - LLM nhận 10,000 words nhưng answer chỉ cần 200 words
Lợi ích của chunking:
Chia document theo số tokens/characters cố định.
def fixed_size_chunk(text, chunk_size=512, overlap=50):
"""
Chunk text with fixed size and overlap.
Args:
chunk_size: Number of characters per chunk
overlap: Characters overlap between chunks
"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - overlap # Overlap để giữ context
return chunks
# Example
text = "Long document..." * 1000
chunks = fixed_size_chunk(text, chunk_size=1000, overlap=100)
Overlap là gì và tại sao cần?
Chunk 1: [...........................]
Chunk 2: [...........................]
|--- overlap ---|
# Overlap giúp:
# - Tránh cắt câu/đoạn giữa chừng
# - Context continuity giữa chunks
Ưu điểm:
Nhược điểm:
Khi nào dùng: Simple use cases, homogeneous documents
Chia theo boundaries tự nhiên (sentences).
import nltk
nltk.download('punkt')
def sentence_chunk(text, max_sentences=5):
"""Chunk by sentences."""
sentences = nltk.sent_tokenize(text)
chunks = []
current_chunk = []
for sent in sentences:
current_chunk.append(sent)
if len(current_chunk) >= max_sentences:
chunks.append(' '.join(current_chunk))
current_chunk = []
# Add remaining sentences
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
Ưu điểm:
Nhược điểm:
Chia based on semantic similarity - đoạn mới khi topic thay đổi.
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunk(text, similarity_threshold=0.5):
"""
Chunk based on semantic similarity between sentences.
"""
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = nltk.sent_tokenize(text)
# Embed all sentences
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Calculate similarity with previous sentence
similarity = np.dot(embeddings[i], embeddings[i-1])
if similarity > similarity_threshold:
# Similar topic - add to current chunk
current_chunk.append(sentences[i])
else:
# Topic changed - start new chunk
chunks.append(' '.join(current_chunk))
current_chunk = [sentences[i]]
# Add last chunk
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
Ưu điểm:
Nhược điểm:
Khi nào dùng: High-quality requirements, documents với varied topics
Chia hierarchically - tôn trọng document structure.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def recursive_chunk(text, chunk_size=1000, chunk_overlap=200):
"""
Recursively split by different separators.
Priority: \n\n (paragraphs) → \n (lines) → . (sentences) → space
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(text)
return chunks
# Example
text = """
# Chapter 1: Introduction
Machine learning is a subset of AI.
It enables computers to learn from data.
# Chapter 2: Algorithms
Common algorithms include:
- Linear Regression
- Decision Trees
"""
chunks = recursive_chunk(text)
# Respects headings, paragraphs, lists
Ưu điểm:
Nhược điểm:
Khi nào dùng: Structured documents (markdown, code, articles)
Code chunking:
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=100
)
# Respects function/class boundaries
code = """
def function1():
pass
class MyClass:
def method1(self):
pass
"""
chunks = python_splitter.split_text(code)
Markdown chunking:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
chunks = splitter.split_text(markdown_text)
# Each chunk includes header hierarchy as metadata
Trade-offs:
Small chunks (256 tokens):
✅ Precise retrieval
✅ Less noise
❌ May lose context
❌ More chunks = slower search
Large chunks (2048 tokens):
✅ More context
✅ Fewer chunks
❌ Diluted semantics
❌ Expensive LLM processing
Recommended sizes:
How to find optimal size:
# Experiment with different sizes
chunk_sizes = [256, 512, 1024, 2048]
results = {}
for size in chunk_sizes:
chunks = create_chunks(documents, chunk_size=size)
# Evaluate retrieval quality
precision = evaluate_retrieval(chunks, test_queries)
results[size] = precision
# Plot and choose best
Pure vector search có thể miss exact matches. Hybrid search kết hợp cả hai.
BM25 là evolution của TF-IDF, ranking documents by keyword relevance.
from rank_bm25 import BM25Okapi
# Corpus
documents = [
"Machine learning is a subset of AI",
"Deep learning uses neural networks",
"Python is a programming language"
]
# Tokenize
tokenized_docs = [doc.lower().split() for doc in documents]
# Build BM25 index
bm25 = BM25Okapi(tokenized_docs)
# Query
query = "machine learning AI"
tokenized_query = query.lower().split()
# Get scores
scores = bm25.get_scores(tokenized_query)
print(scores) # [high, low, low]
BM25 strengths:
BM25 weaknesses:
Method 1: Reciprocal Rank Fusion (RRF)
def reciprocal_rank_fusion(bm25_results, vector_results, k=60):
"""
Combine rankings from multiple sources.
Score for doc_id = Σ(1 / (k + rank_i))
"""
scores = {}
# Add BM25 scores
for rank, doc_id in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
# Add vector scores
for rank, doc_id in enumerate(vector_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
# Sort by combined score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [doc_id for doc_id, _ in ranked]
# Example
bm25_results = [1, 3, 5] # doc IDs ranked by BM25
vector_results = [2, 1, 4] # doc IDs ranked by vector search
combined = reciprocal_rank_fusion(bm25_results, vector_results)
# [1, 2, 3, ...] # doc 1 appears in both → higher score
Method 2: Weighted Fusion
def weighted_fusion(bm25_scores, vector_scores, alpha=0.5):
"""
Combine scores with weights.
Final score = alpha * bm25_score + (1-alpha) * vector_score
"""
combined = {}
all_doc_ids = set(bm25_scores.keys()) | set(vector_scores.keys())
for doc_id in all_doc_ids:
bm25_score = bm25_scores.get(doc_id, 0)
vector_score = vector_scores.get(doc_id, 0)
# Normalize scores to [0, 1] trước khi combine
combined[doc_id] = alpha * bm25_score + (1 - alpha) * vector_score
return sorted(combined.items(), key=lambda x: x[1], reverse=True)
Tuning alpha:
Experiment to find optimal:
for alpha in [0.1, 0.3, 0.5, 0.7, 0.9]:
results = weighted_fusion(bm25_scores, vector_scores, alpha)
precision = evaluate(results, ground_truth)
print(f"Alpha {alpha}: Precision {precision}")
Weaviate example:
result = client.query.get("Document", ["text"]) \
.with_hybrid(
query="machine learning",
alpha=0.5 # 0.5 = balanced keyword/vector
) \
.with_limit(10) \
.do()
Qdrant example:
from qdrant_client.models import Prefetch, Query
result = client.query_points(
collection_name="documents",
prefetch=[
Prefetch(query="machine learning", using="keywords", limit=20),
Prefetch(query=[0.1, 0.2, ...], using="vectors", limit=20)
],
query=Query(fusion="rrf"), # Reciprocal Rank Fusion
limit=10
)
Retrieval returns top-k (e.g., 100) candidates. Re-ranking refines to top-n (e.g., 5) best.
Two-stage retrieval:
Stage 1: Fast retrieval (Bi-encoder)
├─ Query embedding: Single forward pass
├─ Compare with millions of docs: Fast (dot product)
└─ Top-100 candidates
Stage 2: Precise re-ranking (Cross-encoder)
├─ Query + Doc together: Attention across both
├─ More accurate but slower
└─ Top-5 results
Bi-encoder (Retrieval):
# Encode separately
query_emb = model.encode("What is Python?")
doc_emb = model.encode("Python is a programming language")
# Compare
similarity = cosine_similarity(query_emb, doc_emb)
Pros: Fast - can pre-compute doc embeddings
Cons: No interaction between query và doc
Cross-encoder (Re-ranking):
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Encode query + doc together
score = model.predict([
("What is Python?", "Python is a programming language"),
("What is Python?", "JavaScript is also a language")
])
# score[0] > score[1]
Pros: More accurate - full attention
Cons: Slow - can't pre-compute
from sentence_transformers import SentenceTransformer, CrossEncoder
# Models
retriever = SentenceTransformer('all-MiniLM-L6-v2')
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def retrieve_and_rerank(query, documents, top_k=5):
# Stage 1: Fast retrieval (top-100)
query_emb = retriever.encode(query)
doc_embs = retriever.encode(documents)
similarities = [
cosine_similarity(query_emb, doc_emb)
for doc_emb in doc_embs
]
# Get top-100 candidates
top_100_indices = np.argsort(similarities)[-100:][::-1]
candidates = [documents[i] for i in top_100_indices]
# Stage 2: Precise re-ranking
pairs = [(query, doc) for doc in candidates]
rerank_scores = reranker.predict(pairs)
# Get top-k
top_k_indices = np.argsort(rerank_scores)[-top_k:][::-1]
final_results = [candidates[i] for i in top_k_indices]
return final_results
Performance impact:
Retrieval only: 100ms, 70% accuracy
Retrieval + Re-ranking: 150ms, 85% accuracy
Trade-off: +50ms for +15% accuracy → Worth it!
import cohere
co = cohere.Client("your-key")
# Retrieved documents
documents = [
"Document 1 text...",
"Document 2 text...",
# ... 100 documents
]
# Re-rank
results = co.rerank(
query="What is machine learning?",
documents=documents,
top_n=5,
model="rerank-english-v2.0"
)
# Get top 5 re-ranked
for idx, result in enumerate(results.results):
print(f"{idx}: {documents[result.index]} (score: {result.relevance_score})")
User queries often không phù hợp trực tiếp cho retrieval.
Thêm related terms để expand coverage.
from transformers import pipeline
# Use LLM to expand query
def expand_query(query):
generator = pipeline('text-generation', model='gpt2')
prompt = f"""Expand this search query with related terms:
Query: {query}
Expanded:"""
result = generator(prompt, max_length=50)
return result[0]['generated_text']
# Example
original = "machine learning"
expanded = expand_query(original)
# "machine learning artificial intelligence neural networks deep learning"
Classic expansion với synonyms:
from nltk.corpus import wordnet
def expand_with_synonyms(query):
words = query.split()
expanded = []
for word in words:
synonyms = wordnet.synsets(word)
for syn in synonyms[:2]: # Top 2 synonyms
expanded.extend(syn.lemma_names())
return ' '.join(set(words + expanded))
# "car" → "car automobile vehicle motorcar"
Chia complex query thành sub-queries.
def decompose_query(complex_query):
"""
Example:
"Compare Python and JavaScript for web development"
→
[
"Python for web development",
"JavaScript for web development",
"Python vs JavaScript comparison"
]
"""
# Use LLM to decompose
prompt = f"""Break down this complex query into simpler sub-queries:
Query: {complex_query}
Sub-queries:"""
# Call LLM
sub_queries = llm.generate(prompt)
return sub_queries
# Retrieve for each sub-query and combine results
Generate hypothetical answer, embed it, use for search.
def hyde_search(query, vector_db):
# Step 1: Generate hypothetical document
prompt = f"""Write a detailed answer to this question:
Question: {query}
Answer:"""
hypothetical_doc = llm.generate(prompt)
# Step 2: Embed hypothetical doc
hyp_embedding = embed_model.encode(hypothetical_doc)
# Step 3: Search with hypothetical embedding
results = vector_db.search(hyp_embedding, top_k=5)
return results
# Why this works:
# - Query: "What is Python?" → short, generic embedding
# - Hypothetical answer: "Python is a high-level..." → rich, detailed embedding
# - Better matches with actual documents
Trong bài tiếp theo, chúng ta sẽ khám phá LLM Orchestration Frameworks - LangChain, LlamaIndex, và cách build complex RAG applications.
Bài viết thuộc series "From Zero to AI Engineer" - Module 7: RAG (Retrieval-Augmented Generation)