What You'll Build
By the end of this tutorial, you'll have a working Retrieval-Augmented Generation (RAG) system that uses sentence window retrieval instead of naive chunking. Instead of splitting documents into arbitrary 512-token chunks, you'll retrieve small, precise sentences for matching, then expand the context window to include surrounding sentences before feeding to your LLM.
The final result is a Python application that ingests documents, builds a searchable index with sentence-level granularity, and answers questions with better context preservation than standard RAG implementations. You'll compare outputs side-by-side to see how sentence window retrieval preserves context that gets lost at chunk boundaries, particularly when explanations span multiple sentences.
Prerequisites
- Python 3.10+ (tested on 3.10.12 and 3.11.7)
- OpenAI API key with access to embeddings and GPT-4 (or GPT-3.5-turbo)
- 8GB+ RAM - the vector store can be memory-intensive with large documents
- Estimated cost: ~$0.01-0.02 for the tutorial (using GPT-3.5-turbo)
- Basic familiarity with embeddings and vector similarity search
- Estimated time: 45-60 minutes including testing
Install required packages:
pip install llama-index==0.9.48 openai==1.12.0 nltk==3.8.2
# For hybrid search feature (optional):
pip install llama-index-retrievers-bm25
Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY='sk-your-actual-key-here'
On Windows:
set OPENAI_API_KEY=sk-your-actual-key-here
Step-by-Step Instructions
Step 1: Set Up the Project Structure
Create a new directory and the basic files:
mkdir rag-sentence-window
cd rag-sentence-window
touch sentence_window_rag.py
mkdir data
What this does: Creates a clean workspace with a Python file for the code and a data/ folder for test documents.
Step 2: Create a Test Document
Create a sample document that demonstrates why sentence window retrieval matters. Save this as data/sample.txt:
The transformer architecture was introduced in 2017. It revolutionized natural language processing. The key innovation was the self-attention mechanism. This mechanism allows the model to weigh the importance of different words in a sequence. Traditional RNNs processed sequences sequentially. This created bottlenecks for parallelization. Transformers process all tokens simultaneously. This enables much faster training on modern GPUs.
The attention mechanism computes three vectors for each token: query, key, and value. These vectors are learned during training. The dot product of query and key vectors determines attention weights. Higher weights mean stronger connections between tokens. The weighted sum of value vectors produces the output. This process happens in multiple attention heads simultaneously.
BERT was released by Google in 2018. It used bidirectional training of transformers. Previous autoregressive models like GPT-1/GPT-2 only looked at left context. BERT looks at both left and right context. This improved performance on many NLP tasks. BERT achieved state-of-the-art results on GLUE benchmark.
Why this document? Context flows between sentences. Naive chunking at 100 tokens might split "BERT looks at both left and right context. This improved performance..." across chunks, losing the causal relationship. Sentence window retrieval preserves these connections.
Step 3: Build the Sentence Window Retriever
Open sentence_window_rag.py and add this code:
import os
import nltk
from llama_index import (
Document,
ServiceContext,
VectorStoreIndex,
)
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.postprocessor import MetadataReplacementPostProcessor
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding
# Download required NLTK data for sentence tokenization
nltk.download('punkt', quiet=True)
def load_documents(data_dir="data"):
"""Load all .txt files from data directory."""
documents = []
# Check if directory exists
if not os.path.exists(data_dir):
raise FileNotFoundError(f"Data directory '{data_dir}' not found. Create it and add .txt files.")
for filename in os.listdir(data_dir):
if filename.endswith('.txt'):
filepath = os.path.join(data_dir, filename)
try:
with open(filepath, 'r', encoding='utf-8') as f:
text = f.read()
if not text.strip():
print(f"Warning: {filename} is empty, skipping.")
continue
# Create Document object with metadata
doc = Document(
text=text,
metadata={"filename": filename}
)
documents.append(doc)
except Exception as e:
print(f"Error reading {filename}: {e}")
continue
if not documents:
raise ValueError(f"No valid .txt files found in '{data_dir}'")
print(f"Loaded {len(documents)} documents")
return documents
def build_sentence_window_index(documents, window_size=3):
"""
Build index using sentence window retrieval.
Args:
documents: List of Document objects
window_size: Number of sentences before/after to include as context
Returns:
tuple: (index, service_context)
"""
# Initialize LLM and embedding model
# Using GPT-3.5-turbo for cost-effectiveness in tutorials (~$0.002/query vs $0.03 for GPT-4)
# Upgrade to GPT-4 for production if higher reasoning quality is needed
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
# Create sentence window node parser
# This splits documents into sentences but keeps metadata about surrounding sentences
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=window_size, # sentences before and after
window_metadata_key="window",
original_text_metadata_key="original_text",
)
# Build service context with our models
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
node_parser=node_parser,
)
# Create the index
print("Building sentence window index...")
index = VectorStoreIndex.from_documents(
documents,
service_context=service_context,
)
print(f"Index built with {len(index.docstore.docs)} nodes")
return index, service_context
def create_query_engine(index, service_context):
"""Create query engine with metadata replacement post-processor."""
# This post-processor replaces the retrieved sentence with its full window
postprocessor = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
query_engine = index.as_query_engine(
service_context=service_context,
similarity_top_k=3, # retrieve top 3 sentence matches (adjust based on needs)
node_postprocessors=[postprocessor],
)
return query_engine
if __name__ == "__main__":
# Load documents
documents = load_documents()
# Build index with 3-sentence windows
index, service_context = build_sentence_window_index(
documents,
window_size=3
)
# Create query engine
query_engine = create_query_engine(index, service_context)
# Test query
query = "How does the attention mechanism work in transformers?"
print(f"\nQuery: {query}\n")
response = query_engine.query(query)
print(f"Response:\n{response}\n")
# Show source nodes to see the window expansion
print("Source nodes:")
for i, node in enumerate(response.source_nodes, 1):
print(f"\nNode {i} (score: {node.score:.4f}):")
print(f"Original sentence: {node.node.metadata.get('original_text', 'N/A')[:100]}...")
print(f"Window context: {node.node.text[:200]}...")
What's happening here:
- SentenceWindowNodeParser: Splits documents into individual sentences for embedding (precise retrieval) but stores surrounding sentences in metadata
- MetadataReplacementPostProcessor: When retrieving a matching sentence, swaps in the full window before sending to the LLM
- window_size=3: Includes 3 sentences before and 3 sentences after each retrieved sentence
This approach provides precision (matching exact sentences) AND context (surrounding information), superior to fixed chunking.
Step 4: Run the Initial Test
Execute the script:
python sentence_window_rag.py
Expected output structure:
Loaded 1 documents
Building sentence window index...
Index built with 13 nodes
Query: How does the attention mechanism work in transformers?
Response:
The attention mechanism in transformers works by computing three vectors for each token: query,
key, and value. These vectors are learned during training. The mechanism calculates attention
weights by taking the dot product of query and key vectors, where higher weights indicate
stronger connections between tokens. Finally, it produces the output through a weighted sum
of the value vectors, with this process occurring simultaneously across multiple attention heads.
Source nodes:
Node 1 (score: 0.8734): # scores vary per run — these are example values
Original sentence: The attention mechanism computes three vectors for each token: query, key, and value...
Window context: The key innovation was the self-attention mechanism. This mechanism allows the model to
weigh the importance of different words in a sequence. Traditional RNNs processed sequences sequentially...
Node 2 (score: 0.8521): # scores vary per run
Original sentence: Higher weights mean stronger connections between tokens...
Window context: The dot product of query and key vectors determines attention weights. Higher weights
mean stronger connections between tokens. The weighted sum of value vectors produces the output...
What to notice: The retrieval matched specific sentences about the attention mechanism, but the LLM received the full surrounding context (3 sentences before and after). This preserved the explanation flow and enabled a coherent response.
Step 5: Compare with Naive Chunking
Prove sentence window retrieval is better by implementing a comparison. Add this function to sentence_window_rag.py:
def build_naive_chunking_index(documents, chunk_size=200):
"""Build index using naive fixed-size chunking for comparison."""
from llama_index.node_parser import SentenceSplitter
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
# Simple fixed-size chunking with overlap
node_parser = SentenceSplitter.from_defaults(
chunk_size=chunk_size,
chunk_overlap=20,
)
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
node_parser=node_parser,
)
print("Building naive chunking index...")
index = VectorStoreIndex.from_documents(
documents,
service_context=service_context,
)
return index, service_context
Update your if __name__ == "__main__": block to add the comparison:
if __name__ == "__main__":
# Load documents
documents = load_documents()
# Build sentence window index
index, service_context = build_sentence_window_index(
documents,
window_size=3
)
# Create query engine
query_engine = create_query_engine(index, service_context)
# Test query
query = "How does the attention mechanism work in transformers?"
print(f"\nQuery: {query}\n")
response = query_engine.query(query)
print(f"Sentence Window Response:\n{response}\n")
# Show source nodes
print("Source nodes:")
for i, node in enumerate(response.source_nodes, 1):
print(f"\nNode {i} (score: {node.score:.4f}):")
print(f"Original sentence: {node.node.metadata.get('original_text', 'N/A')[:100]}...")
print(f"Window context: {node.node.text[:200]}...")
# Compare with naive chunking
print("\n" + "="*80)
print("COMPARISON: Naive Chunking Approach")
print("="*80)
naive_index, naive_service_context = build_naive_chunking_index(documents)
naive_query_engine = naive_index.as_query_engine(
service_context=naive_service_context,
similarity_top_k=3,
)
naive_response = naive_query_engine.query(query)
print(f"\nNaive Chunking Response:\n{naive_response}\n")
Run the updated script:
python sentence_window_rag.py
What to observe: The naive chunking approach may split the attention mechanism explanation awkwardly depending on where chunk boundaries fall. Response quality varies unpredictably based on how the text gets divided—that's exactly the problem sentence window retrieval solves.
Step 6: Add Interactive Query Mode
Make the system interactive for easier testing. Add this function:
def interactive_mode(query_engine):
"""Run interactive query loop."""
print("\n" + "="*80)
print("Interactive Mode - Type 'quit' to exit")
print("="*80 + "\n")
while True:
query = input("Your question: ").strip()
if query.lower() in ['quit', 'exit', 'q']:
break
if not query:
continue
response = query_engine.query(query)
print(f"\nAnswer: {response}\n")
# Optionally show sources
show_sources = input("Show sources? (y/n): ").strip().lower()
if show_sources == 'y':
for i, node in enumerate(response.source_nodes, 1):
print(f"\nSource {i} (relevance: {node.score:.4f}):")
print(node.node.text[:300] + "...")
print()
Update your main block to use interactive mode:
if __name__ == "__main__":
documents = load_documents()
index, service_context = build_sentence_window_index(documents, window_size=3)
query_engine = create_query_engine(index, service_context)
# Run interactive mode
interactive_mode(query_engine)
What this enables: A REPL-style interface for testing different queries and exploring how various questions retrieve different context windows. Try queries like:
- "What year was the transformer architecture introduced?"
- "How does BERT differ from GPT?"
- "What are the three vectors in the attention mechanism?"
Verification
Confirm everything works correctly with these checks:
- Verify index creation: You should see output like "Index built with 13 nodes" where the number roughly matches the sentence count in your document.
- Test a specific query: Ask "What year was the transformer architecture introduced?" - the response should include "2017" with surrounding context about the innovation.
- Inspect source nodes: When you show sources, the "Window context" should be noticeably longer (typically 5-7 sentences) than the "Original sentence" (1 sentence).
- Check API calls: You should see network activity indicating OpenAI API calls for both embeddings (during indexing) and completions (during queries).
- Test window expansion: Run this verification query in interactive mode:
The response should include context from surrounding sentences about how these vectors are used, not just their definition.query = "What are query, key, and value vectors?"
Success indicators: Coherent, contextual answers that naturally reference information from multiple sentences, with source nodes showing expanded context windows.
Common Issues & Fixes
Issue 1: NLTK punkt tokenizer error
Note: If you see Resource punkt_tab not found, your NLTK version may be newer than 3.8.2. Only punkt is needed for sentence tokenization.
Error message:
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
**********************************************************************
Fix: Manually download NLTK data:
python -c "import nltk; nltk.download('punkt')"
If that fails due to SSL issues (common on corporate networks), specify a download directory:
python -c "import nltk; nltk.download('punkt', download_dir='~/nltk_data')"
Then set the NLTK data path in your script before importing:
import nltk
nltk.data.path.append('/path/to/nltk_data')
Issue 2: Index built with 0 nodes
Symptoms: Output shows "Index built with 0 nodes" and queries return no results.
Fix: Verify your documents loaded correctly:
ls -la data/
# Should show sample.txt with non-zero size
Check file encoding:
file data/sample.txt
# Should show: ASCII text or UTF-8 Unicode text
If the file exists but isn't loading, add debug output:
def load_documents(data_dir="data"):
documents = []
print(f"Looking for files in: {os.path.abspath(data_dir)}")
for filename in os.listdir(data_dir):
print(f"Found file: {filename}")
if filename.endswith('.txt'):
# ... rest of function
Issue 3: OpenAI rate limit errors
Error message:
openai.RateLimitError: You exceeded your current quota, please check your plan and billing details.
Fix option 1: The tutorial already uses GPT-3.5-turbo by default. To upgrade to GPT-4:
llm = OpenAI(model="gpt-4", temperature=0.1) # Higher quality, ~15x cost
Cost Reference:
- GPT-3.5-turbo: ~$0.0015/1K tokens (tutorial default)
- GPT-4: ~$0.03/1K tokens (~20x more expensive)
- Embedding (ada-002): ~$0.0001/1K tokens
- Total tutorial cost: ~$0.01-0.02 for indexing the sample document
Fix option 2: Add retry logic with exponential backoff:
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(3))
def query_with_retry(query_engine, query):
return query_engine.query(query)
# Use it like:
response = query_with_retry(query_engine, "Your question here")
Issue 4: ImportError for llama_index modules
.core paths (e.g., llama_index.core.node_parser), you're looking at v0.10+ documentation.
Always check the version:
pip show llama-index
Error message:
ImportError: cannot import name 'SentenceWindowNodeParser' from 'llama_index.core.node_parser'
Fix: Ensure you're using the correct llama-index version:
pip show llama-index
# Should show version 0.9.48
If the version is different, reinstall:
pip uninstall llama-index
pip install llama-index==0.9.48
Next Steps
You now have a working sentence window retrieval system. Here's how to extend it:
When NOT to Use Sentence Window Retrieval
Sentence window retrieval isn't always the best choice. Consider alternatives when:
- Documents are very short: If your docs are under 500 tokens, sentence-level splitting adds unnecessary complexity. Use simple chunking instead.
- Content is code-heavy: Code doesn't follow natural sentence boundaries. Punctuation in strings, comments, and syntax can confuse sentence tokenizers.
- Working with non-English text: NLTK's
punkttokenizer is English-optimized. Chinese, Japanese, Arabic, and other languages may have poor sentence boundaries. - Content is structured data: Tables, lists, and JSON don't have "sentences" in the traditional sense. Consider parent document retrieval or auto-merging retrieval instead.
- Latency is critical: The extra post-processing step adds ~50-100ms per query. For ultra-low-latency requirements, use simpler retrieval.
Immediate Improvements
- Tune the window size: Experiment with different values:
Window size 3 works well for technical documentation, but adjust based on your content.# More context (7 sentences total) build_sentence_window_index(documents, window_size=3) # Less context but more precision (3 sentences total) build_sentence_window_index(documents, window_size=1) # Maximum context (11 sentences total) build_sentence_window_index(documents, window_size=5) - Add document metadata filtering: Extend metadata to filter at query time:
doc = Document( text=text, metadata={ "filename": filename, "doc_type": "technical", "date": "2024-01-15" } ) # Then filter during queries from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter filters = MetadataFilters( filters=[ExactMatchFilter(key="doc_type", value="technical")] ) query_engine = index.as_query_engine(filters=filters) - Implement hybrid search: Combine vector similarity with BM25 keyword search (requires
pip install llama-index-retrievers-bm25):from llama_index.retrievers.bm25 import BM25Retriever from llama_index.retrievers import QueryFusionRetriever vector_retriever = index.as_retriever(similarity_top_k=3) # BM25 uses the docstore for keyword-based retrieval bm25_retriever = BM25Retriever.from_defaults( docstore=index.docstore, similarity_top_k=3 ) retriever = QueryFusionRetriever( [vector_retriever, bm25_retriever], similarity_top_k=3, )
Production Enhancements
- Persist the index: Save to disk to avoid rebuilding on every run:
# Save index to disk index.storage_context.persist(persist_dir="./storage") # Load index from disk later from llama_index import StorageContext, load_index_from_storage storage_context = StorageContext.from_defaults(persist_dir="./storage") index = load_index_from_storage(storage_context)This avoids rebuilding the index on every run, saving API costs and time.
No comments:
Post a Comment