Khi bạn search "dog" trên Google, nó cũng hiểu "puppy", "canine", "pet". Khi bạn hỏi chatbot "How do I reset my password?", nó match được với document có title "Password Recovery Guide" dù không có từ "reset". Đằng sau những điều này là vector embeddings và semantic search.
RAG (Retrieval-Augmented Generation) - kỹ thuật cho phép LLMs truy cập knowledge base - hoàn toàn dựa trên vector search. Hiểu sâu về embeddings và vector search là bước đầu tiên để xây dựng RAG systems hiệu quả.
Embedding là cách biểu diễn text (hoặc image, audio) thành vectors (arrays of numbers) trong không gian nhiều chiều sao cho semantic meaning được preserve.
"The cat sat on the mat"
↓ Embedding Model
[0.234, -0.567, 0.891, ..., 0.123] # Vector 768 chiều (ví dụ)
Tại sao cần embeddings?
Texts có ý nghĩa giống nhau → vectors gần nhau trong không gian.
# Ví dụ với sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"The cat sleeps on the couch",
"A feline rests on the sofa", # Semantic similar
"Python is a programming language" # Semantic different
]
embeddings = model.encode(sentences)
# embeddings[0] và embeddings[1] rất gần nhau
# embeddings[0] và embeddings[2] rất xa nhau
Sparse vectors: Hầu hết values = 0 (ví dụ: Bag of Words)
# Vocabulary: ["cat", "dog", "fish", "bird"]
"cat and dog" → [1, 1, 0, 0] # Chỉ có 2 non-zero values
Dense vectors: Hầu hết values ≠ 0 (embeddings thường là dense)
"cat and dog" → [0.234, -0.567, 0.891, 0.123, ...] # Tất cả đều có giá trị
Trade-offs:
Embedding cho từng từ riêng lẻ.
# Word2Vec example (conceptual)
"king" → [0.2, 0.5, 0.8, ...]
"queen" → [0.25, 0.48, 0.75, ...]
"man" → [-0.1, 0.3, 0.6, ...]
"woman" → [-0.08, 0.28, 0.55, ...]
# Famous example: king - man + woman ≈ queen
Hạn chế:
Embedding phụ thuộc vào context.
# "bank" có vectors khác nhau tùy context
sentence1 = "I deposited money at the bank"
sentence2 = "We sat by the river bank"
# BERT tạo vectors khác nhau cho "bank" trong 2 câu này
Cách hoạt động:
Optimize để tạo sentence-level embeddings tốt.
from sentence_transformers import SentenceTransformer
# Popular models
model = SentenceTransformer('all-MiniLM-L6-v2') # Small, fast
# model = SentenceTransformer('all-mpnet-base-v2') # Better quality
# model = SentenceTransformer('all-MiniLM-L12-v2') # Balanced
text = "Artificial intelligence is transforming healthcare"
embedding = model.encode(text)
print(embedding.shape) # (384,) - vector 384 chiều
Sentence-BERT architecture:
Input Sentence
↓
BERT Encoder
↓
Pooling (mean/max/CLS)
↓
Dense Vector (sentence embedding)
OpenAI Embeddings:
from openai import OpenAI
client = OpenAI(api_key="your-key")
response = client.embeddings.create(
model="text-embedding-3-small", # 1536 dimensions
input="Your text here"
)
embedding = response.data[0].embedding
Models:
text-embedding-3-small: 1536D, fast, cheaptext-embedding-3-large: 3072D, best qualitytext-embedding-ada-002: Legacy, still goodCohere Embeddings:
import cohere
co = cohere.Client("your-key")
response = co.embed(
texts=["Text 1", "Text 2"],
model="embed-english-v3.0",
input_type="search_document" # or "search_query"
)
embeddings = response.embeddings
Input types quan trọng:
search_document: Cho documents được indexsearch_query: Cho user queriesLàm sao biết 2 vectors "gần nhau"?
Đo góc giữa 2 vectors.
cosine_similarity = (A · B) / (||A|| × ||B||)
Range: [-1, 1]
1: Hoàn toàn giống nhau
0: Không liên quan
-1: Hoàn toàn ngược nhau
Implementation:
import numpy as np
def cosine_similarity(v1, v2):
dot_product = np.dot(v1, v2)
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
return dot_product / (norm_v1 * norm_v2)
# Example
vec1 = np.array([1, 2, 3])
vec2 = np.array([2, 4, 6]) # Same direction, 2x magnitude
similarity = cosine_similarity(vec1, vec2)
print(similarity) # 1.0 (perfect match)
Ưu điểm:
Khi nào dùng: Text search, document similarity
Đo khoảng cách thẳng giữa 2 điểm.
euclidean_distance = √(Σ(ai - bi)²)
Range: [0, ∞]
0: Identical
Larger: More different
Implementation:
def euclidean_distance(v1, v2):
return np.sqrt(np.sum((v1 - v2) ** 2))
# Or use scipy
from scipy.spatial.distance import euclidean
distance = euclidean(vec1, vec2)
Khi nào dùng: Image embeddings, spatial data
Similarity đơn giản nhất - nhân từng phần tử và cộng lại.
dot_product = Σ(ai × bi)
Range: [-∞, ∞]
Implementation:
similarity = np.dot(v1, v2)
Lưu ý: Nếu vectors đã normalized (||v|| = 1), dot product = cosine similarity!
# Normalize vectors
v1_norm = v1 / np.linalg.norm(v1)
v2_norm = v2 / np.linalg.norm(v2)
# Now dot product = cosine similarity
np.dot(v1_norm, v2_norm) == cosine_similarity(v1, v2) # True
Ưu điểm: Nhanh nhất
Khi nào dùng: Khi vectors đã normalized, cần speed
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
query = "How to train a machine learning model?"
documents = [
"Machine learning model training guide",
"Cooking pasta recipes",
"Deep learning tutorial for beginners"
]
# Encode
query_emb = model.encode(query)
doc_embs = model.encode(documents)
# Calculate similarities
for i, doc in enumerate(documents):
cos_sim = util.cos_sim(query_emb, doc_embs[i])
print(f"{doc}: {cos_sim.item():.4f}")
# Output:
# Machine learning model training guide: 0.7234
# Cooking pasta recipes: 0.1256
# Deep learning tutorial for beginners: 0.6891
Với millions of vectors, làm sao tìm nhanh k vectors gần nhất?
So sánh query với tất cả vectors.
def knn_search(query, vectors, k=5):
similarities = []
for vec in vectors:
sim = cosine_similarity(query, vec)
similarities.append(sim)
# Get top-k indices
top_k_indices = np.argsort(similarities)[-k:][::-1]
return top_k_indices
# Complexity: O(n) - linear scan
Ưu điểm: 100% accurate (exact search)
Nhược điểm: Quá chậm với large datasets (millions of vectors)
Trade accuracy một chút để được speed cực cao.
Ví dụ:
1. HNSW (Hierarchical Navigable Small World)
Build graph structure với multiple layers.
Layer 2: [Few nodes, long range connections]
Layer 1: [More nodes, medium range]
Layer 0: [All nodes, short range]
Cách hoạt động:
Ưu điểm:
Nhược điểm:
Used by: FAISS, Hnswlib, Milvus, Weaviate
2. IVF (Inverted File Index)
Chia vector space thành clusters.
1. Cluster vectors thành N groups (e.g., 1000 clusters)
2. Query → tìm M nearest clusters (e.g., 10 clusters)
3. Search chỉ trong M clusters đó
Ưu điểm:
Nhược điểm:
Used by: FAISS (IVF-Flat, IVF-PQ)
3. Product Quantization (PQ)
Compress vectors để giảm memory.
Original: [768 dimensions × 4 bytes] = 3072 bytes
After PQ: [96 codes × 1 byte] = 96 bytes (32x smaller!)
Cách hoạt động:
Trade-off: Giảm memory/speed up search, nhưng mất accuracy
Used by: FAISS (PQ, IVF-PQ)
import faiss
import numpy as np
# Sample data
dimension = 384
num_vectors = 100000
# Generate random vectors (thực tế dùng embeddings thật)
vectors = np.random.random((num_vectors, dimension)).astype('float32')
# Build HNSW index
index = faiss.IndexHNSWFlat(dimension, 32) # 32 = M (neighbors per node)
index.add(vectors)
# Search
query = np.random.random((1, dimension)).astype('float32')
k = 5
distances, indices = index.search(query, k)
print(f"Top {k} nearest neighbors:")
print(f"Indices: {indices[0]}")
print(f"Distances: {distances[0]}")
FAISS index types:
# Exact search (brute force)
index = faiss.IndexFlatL2(dimension)
# HNSW
index = faiss.IndexHNSWFlat(dimension, 32)
# IVF
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, 100) # 100 clusters
# IVF + PQ (memory efficient)
index = faiss.IndexIVFPQ(quantizer, dimension, 100, 8, 8)
Vector databases optimize cho vector search.
Key components:
┌─────────────────────────────────────┐
│ Vector Database │
├─────────────────────────────────────┤
│ 1. Storage Layer │
│ - Vectors │
│ - Metadata (text, tags, etc.) │
│ │
│ 2. Index Layer (HNSW/IVF) │
│ - Optimize search │
│ │
│ 3. Query Engine │
│ - Vector search │
│ - Hybrid search (vector + filters)│
│ │
│ 4. API Layer │
│ - CRUD operations │
│ - Search endpoints │
└─────────────────────────────────────┘
1. Pinecone (Managed, Cloud-only)
import pinecone
# Initialize
pinecone.init(api_key="your-key", environment="us-west1-gcp")
# Create index
pinecone.create_index(
name="my-index",
dimension=384,
metric="cosine"
)
# Connect
index = pinecone.Index("my-index")
# Upsert vectors
vectors = [
("id1", [0.1, 0.2, ...], {"text": "Document 1"}),
("id2", [0.3, 0.4, ...], {"text": "Document 2"})
]
index.upsert(vectors)
# Search
results = index.query(
vector=[0.1, 0.2, ...],
top_k=5,
include_metadata=True
)
2. Weaviate (Open source, self-hosted)
import weaviate
client = weaviate.Client("http://localhost:8080")
# Create schema
schema = {
"class": "Document",
"vectorizer": "text2vec-openai",
"properties": [
{"name": "text", "dataType": ["text"]}
]
}
client.schema.create_class(schema)
# Add data (auto-vectorizes)
client.data_object.create(
class_name="Document",
data_object={"text": "Machine learning tutorial"}
)
# Search
result = client.query.get("Document", ["text"]) \
.with_near_text({"concepts": ["AI training"]}) \
.with_limit(5) \
.do()
3. Qdrant (Open source, Rust-based)
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient("localhost", port=6333)
# Create collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)
# Insert
client.upsert(
collection_name="documents",
points=[
{
"id": 1,
"vector": [0.1, 0.2, ...],
"payload": {"text": "Document 1"}
}
]
)
# Search
results = client.search(
collection_name="documents",
query_vector=[0.1, 0.2, ...],
limit=5
)
4. ChromaDB (Embedded, dev-friendly)
import chromadb
client = chromadb.Client()
# Create collection
collection = client.create_collection("documents")
# Add documents (auto-embeds)
collection.add(
documents=["Document 1", "Document 2"],
ids=["id1", "id2"]
)
# Query
results = collection.query(
query_texts=["machine learning"],
n_results=5
)
| Database | Best For | Pros | Cons |
|---|---|---|---|
| Pinecone | Production, don't want to manage infra | Fully managed, easy scaling | Cost, vendor lock-in |
| Weaviate | Flexible schema, hybrid search | Rich features, good docs | Setup complexity |
| Qdrant | Performance-critical, self-hosted | Fast (Rust), filtering | Newer ecosystem |
| ChromaDB | Prototyping, embedded apps | Super easy, no server needed | Not for large scale |
| FAISS | DIY, custom solutions | Maximum control, free | No built-in persistence |
Trong bài tiếp theo, chúng ta sẽ khám phá Advanced Retrieval Strategies - chunking, hybrid search, re-ranking, và query optimization techniques cho RAG systems.
Bài viết thuộc series "From Zero to AI Engineer" - Module 7: RAG (Retrieval-Augmented Generation)