3K1O ( Three K One O ): Building a Production-Ready RAG Pipeline with Sentence Window Retrieval

What You'll Build

By the end of this tutorial, you'll have a working Retrieval-Augmented Generation (RAG) system that uses sentence window retrieval instead of naive chunking. Instead of splitting documents into arbitrary 512-token chunks, you'll retrieve small, precise sentences for matching, then expand the context window to include surrounding sentences before feeding to your LLM.

The final result is a Python application that ingests documents, builds a searchable index with sentence-level granularity, and answers questions with better context preservation than standard RAG implementations. You'll compare outputs side-by-side to see how sentence window retrieval preserves context that gets lost at chunk boundaries, particularly when explanations span multiple sentences.

Prerequisites

Python 3.10+ (tested on 3.10.12 and 3.11.7)
OpenAI API key with access to embeddings and GPT-4 (or GPT-3.5-turbo)
8GB+ RAM - the vector store can be memory-intensive with large documents
Estimated cost: ~$0.01-0.02 for the tutorial (using GPT-3.5-turbo)
Basic familiarity with embeddings and vector similarity search
Estimated time: 45-60 minutes including testing

Install required packages:

pip install llama-index==0.9.48 openai==1.12.0 nltk==3.8.2

# For hybrid search feature (optional):
pip install llama-index-retrievers-bm25

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY='sk-your-actual-key-here'

On Windows:

set OPENAI_API_KEY=sk-your-actual-key-here

Step-by-Step Instructions

Step 1: Set Up the Project Structure

Create a new directory and the basic files:

mkdir rag-sentence-window
cd rag-sentence-window
touch sentence_window_rag.py
mkdir data

What this does: Creates a clean workspace with a Python file for the code and a data/ folder for test documents.

Step 2: Create a Test Document

Create a sample document that demonstrates why sentence window retrieval matters. Save this as data/sample.txt:

The transformer architecture was introduced in 2017. It revolutionized natural language processing. The key innovation was the self-attention mechanism. This mechanism allows the model to weigh the importance of different words in a sequence. Traditional RNNs processed sequences sequentially. This created bottlenecks for parallelization. Transformers process all tokens simultaneously. This enables much faster training on modern GPUs.

The attention mechanism computes three vectors for each token: query, key, and value. These vectors are learned during training. The dot product of query and key vectors determines attention weights. Higher weights mean stronger connections between tokens. The weighted sum of value vectors produces the output. This process happens in multiple attention heads simultaneously.

BERT was released by Google in 2018. It used bidirectional training of transformers. Previous autoregressive models like GPT-1/GPT-2 only looked at left context. BERT looks at both left and right context. This improved performance on many NLP tasks. BERT achieved state-of-the-art results on GLUE benchmark.

Why this document? Context flows between sentences. Naive chunking at 100 tokens might split "BERT looks at both left and right context. This improved performance..." across chunks, losing the causal relationship. Sentence window retrieval preserves these connections.

Step 3: Build the Sentence Window Retriever

Open sentence_window_rag.py and add this code:

import os
import nltk
from llama_index import (
    Document,
    ServiceContext,
    VectorStoreIndex,
)
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.postprocessor import MetadataReplacementPostProcessor
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding

# Download required NLTK data for sentence tokenization
nltk.download('punkt', quiet=True)

def load_documents(data_dir="data"):
    """Load all .txt files from data directory."""
    documents = []

    # Check if directory exists
    if not os.path.exists(data_dir):
        raise FileNotFoundError(f"Data directory '{data_dir}' not found. Create it and add .txt files.")

    for filename in os.listdir(data_dir):
        if filename.endswith('.txt'):
            filepath = os.path.join(data_dir, filename)
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    text = f.read()
                    if not text.strip():
                        print(f"Warning: {filename} is empty, skipping.")
                        continue
                    # Create Document object with metadata
                    doc = Document(
                        text=text,
                        metadata={"filename": filename}
                    )
                    documents.append(doc)
            except Exception as e:
                print(f"Error reading {filename}: {e}")
                continue

    if not documents:
        raise ValueError(f"No valid .txt files found in '{data_dir}'")

    print(f"Loaded {len(documents)} documents")
    return documents

def build_sentence_window_index(documents, window_size=3):
    """
    Build index using sentence window retrieval.

    Args:
        documents: List of Document objects
        window_size: Number of sentences before/after to include as context

    Returns:
        tuple: (index, service_context)
    """
    # Initialize LLM and embedding model
    # Using GPT-3.5-turbo for cost-effectiveness in tutorials (~$0.002/query vs $0.03 for GPT-4)
    # Upgrade to GPT-4 for production if higher reasoning quality is needed
    llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
    embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

    # Create sentence window node parser
    # This splits documents into sentences but keeps metadata about surrounding sentences
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=window_size,  # sentences before and after
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )

    # Build service context with our models
    service_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )

    # Create the index
    print("Building sentence window index...")
    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context,
    )

    print(f"Index built with {len(index.docstore.docs)} nodes")
    return index, service_context

def create_query_engine(index, service_context):
    """Create query engine with metadata replacement post-processor."""
    # This post-processor replaces the retrieved sentence with its full window
    postprocessor = MetadataReplacementPostProcessor(
        target_metadata_key="window"
    )

    query_engine = index.as_query_engine(
        service_context=service_context,
        similarity_top_k=3,  # retrieve top 3 sentence matches (adjust based on needs)
        node_postprocessors=[postprocessor],
    )

    return query_engine

if __name__ == "__main__":
    # Load documents
    documents = load_documents()

    # Build index with 3-sentence windows
    index, service_context = build_sentence_window_index(
        documents,
        window_size=3
    )

    # Create query engine
    query_engine = create_query_engine(index, service_context)

    # Test query
    query = "How does the attention mechanism work in transformers?"
    print(f"\nQuery: {query}\n")
    response = query_engine.query(query)
    print(f"Response:\n{response}\n")

    # Show source nodes to see the window expansion
    print("Source nodes:")
    for i, node in enumerate(response.source_nodes, 1):
        print(f"\nNode {i} (score: {node.score:.4f}):")
        print(f"Original sentence: {node.node.metadata.get('original_text', 'N/A')[:100]}...")
        print(f"Window context: {node.node.text[:200]}...")

What's happening here:

SentenceWindowNodeParser: Splits documents into individual sentences for embedding (precise retrieval) but stores surrounding sentences in metadata
MetadataReplacementPostProcessor: When retrieving a matching sentence, swaps in the full window before sending to the LLM
window_size=3: Includes 3 sentences before and 3 sentences after each retrieved sentence

This approach provides precision (matching exact sentences) AND context (surrounding information), superior to fixed chunking.

Step 4: Run the Initial Test

Execute the script:

python sentence_window_rag.py

Expected output structure:

Loaded 1 documents
Building sentence window index...
Index built with 13 nodes

Query: How does the attention mechanism work in transformers?

Response:
The attention mechanism in transformers works by computing three vectors for each token: query,
key, and value. These vectors are learned during training. The mechanism calculates attention
weights by taking the dot product of query and key vectors, where higher weights indicate
stronger connections between tokens. Finally, it produces the output through a weighted sum
of the value vectors, with this process occurring simultaneously across multiple attention heads.

Source nodes:

Node 1 (score: 0.8734):  # scores vary per run — these are example values
Original sentence: The attention mechanism computes three vectors for each token: query, key, and value...
Window context: The key innovation was the self-attention mechanism. This mechanism allows the model to
weigh the importance of different words in a sequence. Traditional RNNs processed sequences sequentially...

Node 2 (score: 0.8521):  # scores vary per run
Original sentence: Higher weights mean stronger connections between tokens...
Window context: The dot product of query and key vectors determines attention weights. Higher weights
mean stronger connections between tokens. The weighted sum of value vectors produces the output...

What to notice: The retrieval matched specific sentences about the attention mechanism, but the LLM received the full surrounding context (3 sentences before and after). This preserved the explanation flow and enabled a coherent response.

Step 5: Compare with Naive Chunking

Prove sentence window retrieval is better by implementing a comparison. Add this function to sentence_window_rag.py:

def build_naive_chunking_index(documents, chunk_size=200):
    """Build index using naive fixed-size chunking for comparison."""
    from llama_index.node_parser import SentenceSplitter

    llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
    embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

    # Simple fixed-size chunking with overlap
    node_parser = SentenceSplitter.from_defaults(
        chunk_size=chunk_size,
        chunk_overlap=20,
    )

    service_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )

    print("Building naive chunking index...")
    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context,
    )

    return index, service_context

Update your if __name__ == "__main__": block to add the comparison:

if __name__ == "__main__":
    # Load documents
    documents = load_documents()

    # Build sentence window index
    index, service_context = build_sentence_window_index(
        documents,
        window_size=3
    )

    # Create query engine
    query_engine = create_query_engine(index, service_context)

    # Test query
    query = "How does the attention mechanism work in transformers?"
    print(f"\nQuery: {query}\n")
    response = query_engine.query(query)
    print(f"Sentence Window Response:\n{response}\n")

    # Show source nodes
    print("Source nodes:")
    for i, node in enumerate(response.source_nodes, 1):
        print(f"\nNode {i} (score: {node.score:.4f}):")
        print(f"Original sentence: {node.node.metadata.get('original_text', 'N/A')[:100]}...")
        print(f"Window context: {node.node.text[:200]}...")

    # Compare with naive chunking
    print("\n" + "="*80)
    print("COMPARISON: Naive Chunking Approach")
    print("="*80)

    naive_index, naive_service_context = build_naive_chunking_index(documents)
    naive_query_engine = naive_index.as_query_engine(
        service_context=naive_service_context,
        similarity_top_k=3,
    )

    naive_response = naive_query_engine.query(query)
    print(f"\nNaive Chunking Response:\n{naive_response}\n")

Run the updated script:

python sentence_window_rag.py

What to observe: The naive chunking approach may split the attention mechanism explanation awkwardly depending on where chunk boundaries fall. Response quality varies unpredictably based on how the text gets divided—that's exactly the problem sentence window retrieval solves.

Step 6: Add Interactive Query Mode

Make the system interactive for easier testing. Add this function:

def interactive_mode(query_engine):
    """Run interactive query loop."""
    print("\n" + "="*80)
    print("Interactive Mode - Type 'quit' to exit")
    print("="*80 + "\n")

    while True:
        query = input("Your question: ").strip()
        if query.lower() in ['quit', 'exit', 'q']:
            break
        if not query:
            continue

        response = query_engine.query(query)
        print(f"\nAnswer: {response}\n")

        # Optionally show sources
        show_sources = input("Show sources? (y/n): ").strip().lower()
        if show_sources == 'y':
            for i, node in enumerate(response.source_nodes, 1):
                print(f"\nSource {i} (relevance: {node.score:.4f}):")
                print(node.node.text[:300] + "...")
        print()

Update your main block to use interactive mode:

if __name__ == "__main__":
    documents = load_documents()
    index, service_context = build_sentence_window_index(documents, window_size=3)
    query_engine = create_query_engine(index, service_context)

    # Run interactive mode
    interactive_mode(query_engine)

What this enables: A REPL-style interface for testing different queries and exploring how various questions retrieve different context windows. Try queries like:

"What year was the transformer architecture introduced?"
"How does BERT differ from GPT?"
"What are the three vectors in the attention mechanism?"

Verification

Confirm everything works correctly with these checks:

Verify index creation: You should see output like "Index built with 13 nodes" where the number roughly matches the sentence count in your document.
Test a specific query: Ask "What year was the transformer architecture introduced?" - the response should include "2017" with surrounding context about the innovation.
Inspect source nodes: When you show sources, the "Window context" should be noticeably longer (typically 5-7 sentences) than the "Original sentence" (1 sentence).
Check API calls: You should see network activity indicating OpenAI API calls for both embeddings (during indexing) and completions (during queries).
Test window expansion: Run this verification query in interactive mode:
```
query = "What are query, key, and value vectors?"
```
The response should include context from surrounding sentences about how these vectors are used, not just their definition.

Success indicators: Coherent, contextual answers that naturally reference information from multiple sentences, with source nodes showing expanded context windows.

Common Issues & Fixes

Issue 1: NLTK punkt tokenizer error

Note: If you see Resource punkt_tab not found, your NLTK version may be newer than 3.8.2. Only punkt is needed for sentence tokenization.

Error message:

LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:
**********************************************************************

Fix: Manually download NLTK data:

python -c "import nltk; nltk.download('punkt')"

If that fails due to SSL issues (common on corporate networks), specify a download directory:

python -c "import nltk; nltk.download('punkt', download_dir='~/nltk_data')"

Then set the NLTK data path in your script before importing:

import nltk
nltk.data.path.append('/path/to/nltk_data')

Issue 2: Index built with 0 nodes

Symptoms: Output shows "Index built with 0 nodes" and queries return no results.

Fix: Verify your documents loaded correctly:

ls -la data/
# Should show sample.txt with non-zero size

Check file encoding:

file data/sample.txt
# Should show: ASCII text or UTF-8 Unicode text

If the file exists but isn't loading, add debug output:

def load_documents(data_dir="data"):
    documents = []
    print(f"Looking for files in: {os.path.abspath(data_dir)}")
    for filename in os.listdir(data_dir):
        print(f"Found file: {filename}")
        if filename.endswith('.txt'):
            # ... rest of function

Issue 3: OpenAI rate limit errors

Error message:

openai.RateLimitError: You exceeded your current quota, please check your plan and billing details.

Fix option 1: The tutorial already uses GPT-3.5-turbo by default. To upgrade to GPT-4:

llm = OpenAI(model="gpt-4", temperature=0.1)  # Higher quality, ~15x cost

Cost Reference:

GPT-3.5-turbo: ~$0.0015/1K tokens (tutorial default)
GPT-4: ~$0.03/1K tokens (~20x more expensive)
Embedding (ada-002): ~$0.0001/1K tokens
Total tutorial cost: ~$0.01-0.02 for indexing the sample document

Fix option 2: Add retry logic with exponential backoff:

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(3))
def query_with_retry(query_engine, query):
    return query_engine.query(query)

# Use it like:
response = query_with_retry(query_engine, "Your question here")

Issue 4: ImportError for llama_index modules

Version Warning: This tutorial uses llama-index 0.9.48, which is the last release before the major v0.10 refactoring. If you see import errors with .core paths (e.g., llama_index.core.node_parser), you're looking at v0.10+ documentation.

Always check the version: pip show llama-index

Error message:

ImportError: cannot import name 'SentenceWindowNodeParser' from 'llama_index.core.node_parser'

Fix: Ensure you're using the correct llama-index version:

pip show llama-index
# Should show version 0.9.48

If the version is different, reinstall:

pip uninstall llama-index
pip install llama-index==0.9.48

Next Steps

You now have a working sentence window retrieval system. Here's how to extend it:

When NOT to Use Sentence Window Retrieval

Sentence window retrieval isn't always the best choice. Consider alternatives when:

Documents are very short: If your docs are under 500 tokens, sentence-level splitting adds unnecessary complexity. Use simple chunking instead.
Content is code-heavy: Code doesn't follow natural sentence boundaries. Punctuation in strings, comments, and syntax can confuse sentence tokenizers.
Working with non-English text: NLTK's punkt tokenizer is English-optimized. Chinese, Japanese, Arabic, and other languages may have poor sentence boundaries.
Content is structured data: Tables, lists, and JSON don't have "sentences" in the traditional sense. Consider parent document retrieval or auto-merging retrieval instead.
Latency is critical: The extra post-processing step adds ~50-100ms per query. For ultra-low-latency requirements, use simpler retrieval.

Immediate Improvements

Tune the window size: Experiment with different values:

# More context (7 sentences total)
build_sentence_window_index(documents, window_size=3)

# Less context but more precision (3 sentences total)
build_sentence_window_index(documents, window_size=1)

# Maximum context (11 sentences total)
build_sentence_window_index(documents, window_size=5)

Window size 3 works well for technical documentation, but adjust based on your content.

Add document metadata filtering: Extend metadata to filter at query time:

doc = Document(
    text=text,
    metadata={
        "filename": filename,
        "doc_type": "technical",
        "date": "2024-01-15"
    }
)

# Then filter during queries
from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="doc_type", value="technical")]
)
query_engine = index.as_query_engine(filters=filters)

Implement hybrid search: Combine vector similarity with BM25 keyword search (requires pip install llama-index-retrievers-bm25):

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.retrievers import QueryFusionRetriever

vector_retriever = index.as_retriever(similarity_top_k=3)

# BM25 uses the docstore for keyword-based retrieval
bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore,
    similarity_top_k=3
)

retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=3,
)

Production Enhancements

Persist the index: Save to disk to avoid rebuilding on every run:

# Save index to disk
index.storage_context.persist(persist_dir="./storage")

# Load index from disk later
from llama_index import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

This avoids rebuilding the index on every run, saving API costs and time.

Pages

Saturday, March 7, 2026

Building a Production-Ready RAG Pipeline with Sentence Window Retrieval

What You'll Build

Prerequisites

Step-by-Step Instructions

Step 1: Set Up the Project Structure

Step 2: Create a Test Document

Step 3: Build the Sentence Window Retriever

Step 4: Run the Initial Test

Step 5: Compare with Naive Chunking

Step 6: Add Interactive Query Mode

Verification

Common Issues & Fixes

Issue 1: NLTK punkt tokenizer error

Issue 2: Index built with 0 nodes

Issue 3: OpenAI rate limit errors

Issue 4: ImportError for llama_index modules

Next Steps

When NOT to Use Sentence Window Retrieval

Immediate Improvements

Production Enhancements

Further Reading

No comments:

Post a Comment