Khi bạn bắt đầu build RAG applications, việc quản lý prompt templates, retrieval logic, memory, và error handling có thể trở nên phức tạp nhanh chóng. LLM Orchestration Frameworks ra đời để abstract complexity này, cho phép developers focus vào business logic thay vì low-level implementation.
Trong bài này, chúng ta sẽ khám phá hai frameworks phổ biến nhất - LangChain và LlamaIndex - cùng các patterns để build production-ready LLM applications.
Trước khi đi vào frameworks, hãy hiểu high-level architecture:
┌─────────────────────────────────────────┐
│ LLM Application │
├─────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ User │──────▶│ Prompt │ │
│ │ Query │ │ Engineering │ │
│ └──────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Vector │◀─────│ Retrieval │ │
│ │ DB │ │ Logic │ │
│ └──────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Memory │◀────▶│ LLM │ │
│ │ Store │ │ Calling │ │
│ └──────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Response │ │
│ │ Processing │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────┘
Chain là sequence of operations, output của step này là input của step tiếp theo.
Example workflow:
User Query
↓
Retrieval (get relevant docs)
↓
Prompt Construction (format docs + query)
↓
LLM Call
↓
Response
Router quyết định flow dựa trên input.
User Query
↓
Router
├─→ SQL Chain (if database query)
├─→ RAG Chain (if knowledge retrieval)
└─→ Math Chain (if calculation)
Memory lưu trữ conversation history để maintain context.
Turn 1: "What is Python?"
→ LLM: "Python is a programming language..."
Turn 2: "What are its benefits?"
→ Memory: Include Turn 1 context
→ LLM understands "its" refers to Python
LangChain là framework tổng quát cho building LLM applications.
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
# 1. Setup LLM
llm = OpenAI(temperature=0.7)
# 2. Create prompt template
prompt = PromptTemplate(
input_variables=["product"],
template="What is a good name for a company that makes {product}?"
)
# 3. Create chain
chain = LLMChain(llm=llm, prompt=prompt)
# 4. Run
result = chain.run("eco-friendly water bottles")
print(result) # "AquaPure Eco Bottles"
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
# 1. Load documents
loader = TextLoader("docs.txt")
documents = loader.load()
# 2. Split into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)
# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 4. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff", # "stuff" = put all docs in prompt
retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
# 5. Query
query = "What is machine learning?"
answer = qa_chain.run(query)
print(answer)
1. Stuff Chain
# Put ALL retrieved docs into single prompt
chain_type="stuff"
# Pros: Simple, best quality (LLM sees all context)
# Cons: Limited by context window
# Use when: < 4-5 documents
2. Map-Reduce Chain
chain_type="map_reduce"
# Process:
# 1. Map: Summarize each doc individually
# 2. Reduce: Combine summaries into final answer
# Pros: Handle many documents
# Cons: Multiple LLM calls (slower, expensive)
# Use when: Many documents, need comprehensive answer
3. Refine Chain
chain_type="refine"
# Process:
# 1. Answer based on doc 1
# 2. Refine answer with doc 2
# 3. Refine again with doc 3...
# Pros: Iterative improvement
# Cons: Sequential (slow), early docs have more influence
# Use when: Documents có natural order
4. Map-Rerank Chain
chain_type="map_rerank"
# Process:
# 1. Score each doc independently
# 2. Return answer from highest-scored doc
# Pros: Fast, good for QA
# Cons: Only uses single "best" doc
# Use when: Answer likely in one doc
ConversationBufferMemory - Store all messages:
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
memory = ConversationBufferMemory()
conversation = ConversationChain(
llm=OpenAI(),
memory=memory
)
conversation.predict(input="Hi, I'm John")
# "Hello John! How can I help you?"
conversation.predict(input="What's my name?")
# "Your name is John!"
ConversationBufferWindowMemory - Keep last K messages:
memory = ConversationBufferWindowMemory(k=2)
# Only remembers last 2 interactions
# Pros: Limit token usage
# Cons: Forget older context
ConversationSummaryMemory - Summarize old messages:
from langchain.memory import ConversationSummaryMemory
memory = ConversationSummaryMemory(llm=OpenAI())
# Automatically summarizes old messages
# Pros: Compress long conversations
# Cons: May lose details
VectorStoreMemory - Semantic search in history:
from langchain.memory import VectorStoreRetrieverMemory
memory = VectorStoreRetrieverMemory(
retriever=vectorstore.as_retriever(search_kwargs={"k": 2})
)
# Retrieves relevant past messages, not just recent
# Good for: Long conversations, topic switching
Agents can decide which tools to use based on query.
from langchain.agents import load_tools, initialize_agent, AgentType
from langchain.llms import OpenAI
llm = OpenAI(temperature=0)
# Load tools
tools = load_tools(
["serpapi", "llm-math"], # Google search + calculator
llm=llm
)
# Initialize agent
agent = initialize_agent(
tools,
llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
# Query
agent.run("What is the square root of the population of France?")
# Agent reasoning:
# 1. "I need to find population of France" → Use Google search
# 2. "Now I need to calculate square root" → Use calculator
# 3. Return final answer
Agent types:
ZERO_SHOT_REACT_DESCRIPTION: Choose tool based on descriptionCONVERSATIONAL_REACT_DESCRIPTION: With memorySTRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION: For complex toolsfrom langchain.tools import Tool
def search_database(query):
# Your custom logic
return f"Database results for: {query}"
database_tool = Tool(
name="Database Search",
func=search_database,
description="Useful for searching company database. Input: search query"
)
tools = [database_tool]
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
LlamaIndex (formerly GPT Index) optimized specifically cho data indexing và retrieval.
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import OpenAI
# 1. Load documents
documents = SimpleDirectoryReader("data/").load_data()
# 2. Configure LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
service_context = ServiceContext.from_defaults(llm=llm)
# 3. Create index
index = VectorStoreIndex.from_documents(
documents,
service_context=service_context
)
# 4. Query
query_engine = index.as_query_engine()
response = query_engine.query("What is machine learning?")
print(response)
1. List Index - Sequential scan:
from llama_index import ListIndex
index = ListIndex.from_documents(documents)
# Use when: Small datasets, need to check all docs
2. Tree Index - Hierarchical summarization:
from llama_index import TreeIndex
index = TreeIndex.from_documents(documents)
# Builds tree: leaf nodes = chunks, parent nodes = summaries
# Good for: Summarization tasks
3. Keyword Table Index - Keyword extraction:
from llama_index import KeywordTableIndex
index = KeywordTableIndex.from_documents(documents)
# Extracts keywords from each chunk
# Good for: Keyword-based retrieval
4. Knowledge Graph Index:
from llama_index import KnowledgeGraphIndex
index = KnowledgeGraphIndex.from_documents(documents)
# Extracts entities and relationships
# Good for: Complex relationship queries
LlamaIndex offers flexible response modes:
query_engine = index.as_query_engine(
response_mode="compact" # Options: compact, tree_summarize, refine, etc.
)
# compact: Concatenate chunks, call LLM once
# tree_summarize: Build tree, summarize bottom-up
# refine: Iteratively refine answer with each chunk
from llama_index.indices.query.query_transform import HyDEQueryTransform
# HyDE: Generate hypothetical document
hyde = HyDEQueryTransform(include_original=True)
query_engine = index.as_query_engine(
query_transform=hyde
)
# Automatically applies HyDE before retrieval
from llama_index import ComposableGraph
# Multiple indexes for different data sources
vector_index1 = VectorStoreIndex.from_documents(docs1)
vector_index2 = VectorStoreIndex.from_documents(docs2)
# Compose into graph
graph = ComposableGraph.from_indices(
ListIndex,
[vector_index1, vector_index2],
index_summaries=["Summary of docs1", "Summary of docs2"]
)
# Query decomposition across indexes
query_engine = graph.as_query_engine()
response = query_engine.query("Compare information from both sources")
| Feature | LangChain | LlamaIndex |
|---|---|---|
| Primary Focus | General LLM apps | RAG/Data indexing |
| Flexibility | Very flexible, many options | Opinionated, best practices |
| Learning Curve | Steeper | Gentler |
| RAG Performance | Good | Excellent |
| Agent Support | Excellent | Limited |
| Memory Options | Many types | Basic |
| Data Connectors | ~100 | ~160 (LlamaHub) |
| Use When | Complex workflows, agents | RAG-focused apps |
Complementary use:
# Use LlamaIndex for indexing
from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
# Convert to LangChain retriever
from llama_index.langchain_helpers.retrievers import LlamaIndexRetriever
retriever = LlamaIndexRetriever(index)
# Use in LangChain chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever
)
from langchain.callbacks import get_openai_callback
import logging
def safe_query(chain, query, max_retries=3):
for attempt in range(max_retries):
try:
with get_openai_callback() as cb:
response = chain.run(query)
# Log token usage
logging.info(f"Tokens used: {cb.total_tokens}")
return response
except Exception as e:
logging.error(f"Attempt {attempt + 1} failed: {e}")
if attempt == max_retries - 1:
raise
return "Unable to process query. Please try again."
from langchain.cache import InMemoryCache, SQLiteCache
import langchain
# In-memory cache
langchain.llm_cache = InMemoryCache()
# Persistent cache
langchain.llm_cache = SQLiteCache(database_path=".langchain.db")
# Same query won't call LLM twice
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = OpenAI(
streaming=True,
callbacks=[StreamingStdOutCallbackHandler()]
)
# Prints response token by token
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
result = chain.run("query")
print(f"Total Tokens: {cb.total_tokens}")
print(f"Prompt Tokens: {cb.prompt_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Total Cost (USD): ${cb.total_cost}")
from llama_index.evaluation import ResponseEvaluator
evaluator = ResponseEvaluator()
# Evaluate response quality
response = query_engine.query("What is ML?")
eval_result = evaluator.evaluate(
query="What is ML?",
response=response
)
print(f"Faithfulness: {eval_result.passing}") # True/False
✅ Start simple, iterate
# Start with basic RAG
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
# Then add: chunking strategy, re-ranking, hybrid search, etc.
✅ Monitor token usage
# Always wrap expensive calls
with get_openai_callback() as cb:
response = chain.run(query)
if cb.total_cost > 0.10: # $0.10 threshold
alert_high_cost()
✅ Use appropriate memory type
# Short conversations: ConversationBufferMemory
# Long conversations: ConversationSummaryMemory
# Topic-based: VectorStoreMemory
✅ Implement fallbacks
from langchain.llms import OpenAI, Anthropic
llm = OpenAI()
fallback_llm = Anthropic()
try:
response = llm(prompt)
except Exception:
response = fallback_llm(prompt)
✅ Version control prompts
# Store prompts in files, not hardcoded
with open("prompts/qa_prompt.txt") as f:
prompt_template = f.read()
# Easy to iterate without code changes
Trong bài tiếp theo, chúng ta sẽ khám phá Fine-tuning & Adaptation - LoRA, QLoRA, và cách customize LLMs cho specific domains.
Bài viết thuộc series "From Zero to AI Engineer" - Module 7: RAG (Retrieval-Augmented Generation)